Terry Boo Chee Yee, Li Sheng | P2211194, P2309110 | DAAA/FT/2B/22 | Deep Learning CA2 - Part B (RL)
REINFORCEMENT LEARNING
Overview¶
We have developed a Deep Q-Network (DQN) model that consistently balances the pendulum. The development process began with a basic DQN model, followed by the implementation of exploration versus exploitation strategies. We then transitioned from hard updates to soft updates, reduced the number of episodes, and optimized the tau parameter. Next, we adjusted the number of dense neurons and determined the optimal number of actions. We also fine-tuned the gamma parameter and identified the ideal learning rate. With all these improvements, we successfully created the best DQN model for balancing the pendulum.
We have also explored different reinforcement learning algorithms beyond DQN, including Double DQN, Dueling DQN, and DDPG. For the Double DQN model, we investigated methods such as gradient clipping and prioritized experience replay. In the Dueling DQN model, we explored techniques like the Boltzmann exploration policy.
The final DQN model is evaluated based on its rewards per episode, the moving average of these rewards, and its ability to consistently balance the pendulum using the best weights.
Other Reinforcement Learning Algorithms we explored¶
Double DQN¶
Double DQN is an enhancement to the standard DQN algorithm addressing the overestimation bias in Q-value estimates. It achieves this by decoupling the processes of action selection and action evaluation using two separate networks.
The online network selects the best action, while the target network evaluates the Q-value of that action. This separation reduces the correlation between action selection and evaluation, leading to more accurate Q-value estimates and improved stability and performance during training.
Dueling DQN¶
Dueling DQN is an enhancement to the standard DQN architecture designed to improve learning efficiency and performance. It achieves this by separately estimating two components: the state value function, which quantifies the long-term benefit of being in a particular state, and the advantage function, which evaluates the relative value of taking specific actions in that state.
By separating these components, Dueling DQN can more efficiently identify valuable states and improve generalization across actions, leading to better performance compared to standard DQN.
DDPG¶
Deep Deterministic Policy Gradient (DDPG) is a reinforcement learning algorithm designed for continuous action spaces. It utilizes a deterministic policy that directly outputs specific actions instead of probability distributions.
DDPG uses two neural networks: an actor network for selecting actions and a critic network for evaluating them. The actor-critic framework optimizes both the policy and value functions, with the actor determining actions and the critic estimating the action-value function.
Performance of Final DQN Model¶
We will be aiming for a stable training for DQN as instability can lead to poor performance, convergence issues, and inefficient learning
- Total rewards increase as the number of episodes rises, improving from around -1650 to approximately -3 by episode 25.
- Successful learning with generally higher rewards over time.
- The model shows strong stability; even during spikes, fluctuations are mild, indicating a steady learning process.
- The model reaches near-optimal rewards (close to 0) within about 8 episodes.
- The moving average shows an upward trend, reflecting consistent improvement and effective strategy learning.
- The model consistently balances the pendulum 100% of the time over 20 trials (Using the best weights), indicating robustness and effective control.
The model demonstrates effective learning and stability, achieving a significant improvement in rewards over episodes. It balances the pendulum consistently across multiple trials, reflecting a robust and reliable policy. Despite some fluctuations due to varying starting positions, the model effectively converges to an optimal solution.
Configuration of Final DQN Model¶
We have used the Exploration vs Exploitation¶
Uses epsilon to explore diverse actions early on and shift to exploiting the learned policy as training progresses, preventing suboptimal policies and stabilizing learning.
We have used soft updates instead of hard updates¶
Gradually updates the target network's weights to prevent drastic changes in Q-values, leading to more stable and smooth convergence compared to hard updates.
We have decreased number of episodes¶
Focuses on fewer episodes to improve learning quality and prevent overfitting, as performance stabilizes within the reduced number of episodes.
We have Found the optimal value of tau to be at 0.01¶
Ensures smooth updates to the target network, reducing abrupt changes and contributing to stable learning.
We have increaed the number of dense neurons¶
Enhances model capacity to capture complex patterns in the state-action space, leading to more accurate Q-function approximation and stable learning.
We have found the optimal number of actions to be at 5¶
By optimizing the number of discrete actions to 5, we found a balance that provides sufficient granularity for the agent's decisions while maintaining computational efficiency.
We have found the optimal value for Gamma to be at 0.98¶
Provides a good balance between immediate and future rewards, supporting stable and consistent training by avoiding extreme short-term or long-term reward biases.
We have found the optimal value of learning rate at 0.01¶
Balances the speed and stability of learning, preventing issues from too high or too low rates and aiding in effective convergence.
Optimization Process for Deep Q Network (DQN)¶
Base Model¶
Exploration vs Exploitation¶
Helps balance discovering new strategies and leveraging known ones. Early in training, exploration encourages trying various actions to gather diverse experiences, while over time, decreasing exploration allows focusing on the best-performing strategies. This balance prevents the agent from getting stuck in suboptimal behaviors and supports stable, effective learning.
Using Soft updates instead of hard updates¶
We use soft updates to provide a smoother transition of weights from the main network to the target network, reducing sudden changes that can destabilize learning. Additionally, By gradually updating the target network, the model can adapt more quickly to new information, potentially leading to faster overall convergence
Decrease the number of Episodes¶
The previous model improved in the first 20 episodes but showed limited gains with rewards fluctuating between -500 and -100 afterward. To enhance stability and prevent overfitting, we reduced the number of episodes to 25. By doing this, we aim to focus the model on achieving optimal performance and stability, avoiding unnecessary exploration that can lead to instability and fluctuations.
Finding the optimal value for Tau¶
Previously, we used a soft update for the target model with a tau value of 0.01. Tau controls the rate of updates, with smaller values ensuring stability through gradual changes in Q-values, while larger values can cause oscillations and hinder convergence. To evaluate the impact on stability and performance, we will test tau values of 0.005, 0.01, and 0.05.
Increase the number of Dense Neurons (128)¶
We increased the number of dense neurons to stabilize training. More neurons capture complex relationships better, improving Q-function approximation and resulting in more stable learning.
Finding the optimal number of actions¶
We will test different action numbers (3, 5, 21) to find the most stable training. A larger action space allows more exploration but can cause instability, while a smaller space simplifies learning but may limit optimal policy discovery.
Finding out the opimal value for Gamma¶
We will test gamma values [0.9, 0.94, 0.98] to find the best balance between considering future rewards and training stability. Our goal is to stabilize DQN training by evaluating high gamma values.
Finding out the opimal value for learning rate¶
We will test learning rates [0.001, 0.01, 0.05] to find the best balance between rapid learning and stability in DQN training. Our goal is to stabilize training and ensure efficient learning without causing instability, starting from an initial learning rate of 0.01.
Best Model for DQN¶
Best Model
Note:Stability improves the most when exploration VS exploitation is implemented as initially high epsilon values encourage exploration, preventing early convergence to suboptimal policies. Gradually decreasing epsilon increases exploitation, allowing the agent to rely on learned Q-values, leading to more stable and optimal policy convergence.
Optimization Process for Double DQN¶
Base Model¶
Gradient Clipping¶
Gradient Clipping limits the magnitude of gradients during backpropagation to prevent excessively large updates. This technique improves training stability, especially in RL where large or exploding gradients can lead to unstable learning.
By controlling gradient magnitudes, clipping ensures more stable and steady updates, reduces oscillations, and helps achieve faster, more reliable convergence.
Prioritized Experience Replay¶
Prioritized Memory improves experience replay by sampling transitions based on their importance. Higher-priority transitions are sampled more often, which enhances learning efficiency, exploration, and reduces overestimation bias. In Double DQN, it accelerates convergence and stabilizes training by focusing on critical experiences.
Optimization Process for Dueling DQN¶
Base Model¶
Weight Decay¶
We introduced weight decay to the Adam optimizer to prevent overfitting and stabilize training. This regularization technique helps by penalizing large weights, which improves generalization and maintains a more controlled training process.
Boltzzman Exploration Policy¶
The Boltzmann exploration policy uses a softmax function to turn action values into probabilities, with a temperature parameter balancing exploration and exploitation. This adaptive approach stabilizes learning by adjusting exploration based on Q-values, unlike the fixed exploration rate in ε-greedy policies.
Deep Deterministic Policy Gradient (DDPG)¶
The DDPG Model¶
Deterministic Policy Gradient (DPG) is a reinforcement learning algorithm tailored for continuous action spaces, such as those found in the pendulum problem (Pendulum-v0). Instead of generating probability distributions over actions like stochastic policies, DPG utilizes a deterministic policy that directly outputs specific actions. This method generally involves deep neural networks to approximate both the policy function and the action-value function. By leveraging the Deterministic Policy Gradient theorem, the policy parameters are updated to increase the expected action-value.
Table of Contents¶
- 1. Background Research
- 2. Task Description
- 3. Set Up
- 4. Test Environment
- 5. Evaluation Metrics
- 6. Deep Q Network
- 6.1 Base Model
- 6.2 Exploration vs Exploitation
- 6.3 Using Soft updates instead of hard updates
- 6.4 Decrease the number of Episodes
- 6.5 Finding the optimal value for Tau
- 6.6 Increase the number of Dense Neurons
- 6.7 Finding the optimal number of actions
- 6.8 Finding out the opimal value for Gamma
- 6.9 Finding out the optimal value for learning rate
- 6.11 Best Model for DQN
- 6.12 Running the Best DQN Weights Multiple Times (Test the Best DQN Model)
- 7. Double DQN
- 8. Dueling DQN
- 9. Deep Deterministic Policy Gradient (DDPG)
- 10. Final Model
- 11. Final Model Evaluation
- 12. Conclusions
Background Research ¶
Open AI Gym¶
What is Open AI Gym¶
OpenAI Gym is an open-source toolkit designed for developing and comparing reinforcement learning (RL) algorithms. It provides a wide variety of environments that simulate real-world tasks, allowing researchers and developers to test and benchmark their RL agents.
OpenAI Gym offers numerous environments, including classic control tasks (e.g., CartPole, MountainCar), Atari games, and robotics simulations. This diversity enables experimentation across different types of challenges.
OpenAI Gym provides a consistent API for interacting with environments, making it easier to implement and evaluate different algorithms.
How Open AI Gym works¶
OpenAI Gym works by providing a standardized interface for interacting with various reinforcement learning (RL) environments. Users can create an environment by specifying its name (e.g., CartPole-v1). Each environment simulates a specific task that an RL agent can learn to perform.
The environment follows a consistent API, including key methods such as reset(), which resets the environment to an initial state and returns the initial observation, and step(action), which takes an action as input, updates the environment's state, and returns the new observation, reward, done flag, and additional info. The RL agent interacts with the environment by receiving observations, choosing actions based on its policy, and sending actions back through the step() method. The agent learns from feedback, using rewards to improve its policy through algorithms like Q-learning or policy gradients.
What is Reinforcement Learning¶
Reinforcement Learning(RL) is a type of machine learning technique that enables an agent to learn in an interactive environment by trial and error using feedback from its own actions and experiences.
RL algorithms use a reward-and-punishment paradigm as they process data. They learn from the feedback of each action and self-discover the best processing paths to achieve final outcomes.
Main Components of Reinforcement Learning¶
Agent¶
The agent is the entity that makes decisions. It receives observations from the environment, takes actions based on a policy, and learns from the consequences of these actions. The primary goal of the agent is to maximize the cumulative reward over time.
Environment¶
The environment is where the agent operates. It can be a physical environment, a simulated environment, or a combination of both. The environment provides feedback to the agent in the form of rewards or penalties based on the agent’s actions.
State¶
The state represents the current situation of the agent in the environment. It can include relevant information such as the agent’s location, the presence of obstacles, or any other factors that might impact the agent’s decision-making process.
Action¶
Decisions made by the agent based on the current state are referred to as actions. The agent selects an action from a set of possible actions, which then affects the state of the environment and potentially leads to rewards or penalties.
Reward¶
Rewards are the positive or negative feedback that the agent receives from the environment based on its actions. The goal of reinforcement learning is to maximize the cumulative reward over time, i.e., to find an optimal policy that leads to the highest possible long-term reward.
Policy¶
The policy is a strategy that the agent uses to determine actions based on the current state. It can be represented as a function mapping states to actions. Policies can be deterministic (fixed action for each state) or stochastic (probabilistic distribution over actions).
Applications of Reinforcement Learning¶
Robotics¶
Reinforcement Learning allows robots to learn to pick up and manipulate objects through trial and error. By exploring various gripping techniques and receiving feedback (e.g., whether the object was successfully grasped or dropped), robots optimize their actions. For example, a robot may learn to adjust its grip strength based on the weight and fragility of different objects, leading to more effective handling in real-world scenarios.
Autonomous Vehicles¶
In autonomous vehicles, Reinforcement Learning algorithms evaluate multiple potential routes and their consequences (like travel time, fuel efficiency, and safety) to determine the most optimal path. The vehicle learns from past journeys and real-time data (e.g., traffic conditions and road hazards) to adapt its driving strategy dynamically. This enables better decision-making during navigation and enhances overall driving performance.
Gaming and Simulations¶
Reinforcement Learning agents learn to excel in complex games by engaging in numerous simulated games against themselves or other agents. They develop strategies by exploring different moves, receiving rewards for winning or achieving objectives, and adjusting their tactics accordingly. For instance, in games like Dota 2 or StarCraft II, agents analyze the consequences of actions to refine their strategies, often surpassing human capabilities through extensive training and optimization.
Finance and Trading¶
Reinforcement Learning algorithms create models that continuously learn from historical and real-time market data. By simulating different trading scenarios and evaluating outcomes, these models adapt their strategies to changing market conditions
Chat Bots¶
Reinforcement Learning enhances chatbots by enabling them to learn from user interactions over time. As they engage in conversations, they receive feedback on the quality of their responses (e.g., user satisfaction or escalation of issues). This feedback loop allows chatbots to adjust their conversational strategies, improving their ability to provide relevant and context-aware responses, leading to more natural and effective interactions with users.
How is Reinforcement Learning Used in The Industry¶
Manufacturing¶
Reinforcement Learning is utilized to train robots in manufacturing environments to perform repetitive tasks such as assembly, welding, and packaging. Robots learn from their environment by experimenting with different approaches, receiving feedback on their performance (e.g., successful assembly or errors), and adjusting their actions accordingly.
By optimizing task execution, robots can significantly reduce production time and costs.
Gaming and Entertainment¶
Reinforcement Learning is employed to create non-player characters (NPCs) that adapt their behavior based on player actions. NPCs learn from player strategies and adjust their tactics, making the gameplay experience more dynamic and challenging.
This allows players to enjoy a more engaging experience as NPCs respond realistically to their actions, leading to greater satisfaction and retention.
Automotive¶
Reinforcement Learning automates critical functions in autonomous vehicles, including navigation, decision-making, and obstacle avoidance. Vehicles learn from their surroundings and past experiences to improve driving strategies under various conditions (e.g., urban environments, highway driving).
By continuously learning and adapting to real-world scenarios, autonomous vehicles can enhance safety, reducing the likelihood of accidents.
Healthcare¶
Reinforcement Learning powers virtual health assistants that guide patients through treatment plans, helping them remember to take medications, attend appointments, and follow health guidelines based on learned patient behaviors.
Tailored reminders and recommendations improve patient adherence to treatment plans, leading to better health outcomes.
References
https://aws.amazon.com/what-is/reinforcement-learning/
https://towardsdatascience.com/reinforcement-learning-101-e24b50e1d292
https://arxiv.org/pdf/1811.12560
https://towardsdatascience.com/getting-started-with-openai-gym-d2ac911f5cbc
https://medium.com/@digitaldadababu/reinforcement-learning-the-ultimate-master-guide-cf3a9e0cb6ed
Task Description ¶
The inverted pendulum swingup problem is based on the classic problem in control theory. The system consists of a pendulum attached at one end to a fixed point, and the other end being free. The pendulum starts in a random position and the goal is to apply torque on the free end to swing it into an upright position, with its center of gravity right above the fixed point.
The diagram below specifies the coordinate system used for the implementation of the pendulum's dynamic equations.
x-y: cartesian coordinates of the pendulum's end in meters.theta: angle in radians.tau: torque inN m. Defined as positive counter-clockwise.
Action Space¶
The action is a ndarray with shape (1,) representing the torque applied to free end of the pendulum.
| Num | Action | Min | Max |
|---|---|---|---|
| 0 | Torque | -2.0 | 2.0 |
Observation¶
Space The observation is a ndarray with shape (3,) representing the x-y coordinates of the pendulum's free end and its angular velocity.
| Num | Observation | Min | Max |
|-----|------------------|------|-----|
| 0 | x = cos(theta) | -1.0 | 1.0 |
| 1 | y = sin(theta) | -1.0 | 1.0 |
| 2 | Angular Velocity | -8.0 | 8.0 |
Rewards¶
The reward function is defined as:
r = -(theta2 + 0.1 theta_dt2 + 0.001 * torque2)*
where $\theta$ is the pendulum's angle normalized between [-pi, pi] (with 0 being in the upright position).
Based on the above equation, the minimum reward that can be obtained is -(pi2 + 0.1 82 + 0.001 * 22) = -16.2736044*, while the maximum reward is zero (pendulum is upright with zero velocity and no torque applied).
Starting¶
State The starting state is a random angle in [-pi, pi] and a random angular velocity in [-1,1].
Episode¶
Truncation The episode truncates at 200 time steps. ### Arguments - g: acceleration of gravity measured in (m s-2) used to calculate the pendulum dynamics. The default value is g = 10.0 . gym.make('Pendulum-v0', g=9.81)
Approaches taken to solve this problem¶
DQN (Deep Q-Network)¶
For the inverted pendulum swing-up problem, DQN uses a neural network to approximate the Q-value function, representing the expected future rewards for applying different torques at various states. The network takes the current state as input and outputs Q-values for discrete torque actions.
Double DQN¶
Double DQN addresses the overestimation bias in DQN, which is particularly important for the pendulum problem where precise torque control is needed. By using two separate networks for action selection and evaluation.
Dueling DQN¶
Dueling DQN enhances learning for the pendulum problem by decomposing the Q-value function into
- State Value Stream: Estimates the value of being in a particular state, regardless of the action.
- Advantage Stream: Estimates the advantage of each torque action given the current state.
DDPG (Deep Deterministic Policy Gradient)¶
DDPG is well-suited for the pendulum problem due to its capability to handle continuous action spaces. It uses two networks:
- Actor Network: Outputs continuous torque values directly, providing precise control needed to swing up and balance the pendulum.
- Critic Network: Estimates the Q-value of the state-torque pair, guiding the actor's learning.
DDPG's deterministic policy and ability to output continuous torques make it highly effective for solving the inverted pendulum swing-up problem.
References
https://www.gymlibrary.dev/environments/classic_control/pendulum/
Set up ¶
Installing Required Libraries ¶
!pip install gym==0.17.3
!pip install matplotlib
!pip install tensorflow
!pip install torch torchvision torchaudio
Requirement already satisfied: gym==0.17.3 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (0.17.3) Requirement already satisfied: scipy in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from gym==0.17.3) (1.10.1) Requirement already satisfied: numpy>=1.10.4 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from gym==0.17.3) (1.24.4) Requirement already satisfied: pyglet<=1.5.0,>=1.4.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from gym==0.17.3) (1.5.0) Requirement already satisfied: cloudpickle<1.7.0,>=1.2.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from gym==0.17.3) (1.6.0) Requirement already satisfied: future in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from pyglet<=1.5.0,>=1.4.0->gym==0.17.3) (1.0.0) Requirement already satisfied: matplotlib in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (3.7.5) Requirement already satisfied: contourpy>=1.0.1 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from matplotlib) (1.1.1) Requirement already satisfied: cycler>=0.10 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from matplotlib) (0.12.1) Requirement already satisfied: fonttools>=4.22.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from matplotlib) (4.51.0) Requirement already satisfied: kiwisolver>=1.0.1 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from matplotlib) (1.4.5) Requirement already satisfied: numpy<2,>=1.20 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from matplotlib) (1.24.4) Requirement already satisfied: packaging>=20.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from matplotlib) (23.2) Requirement already satisfied: pillow>=6.2.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from matplotlib) (10.3.0) Requirement already satisfied: pyparsing>=2.3.1 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from matplotlib) (3.1.2) Requirement already satisfied: python-dateutil>=2.7 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from matplotlib) (2.8.2) Requirement already satisfied: importlib-resources>=3.2.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from matplotlib) (6.1.1) Requirement already satisfied: zipp>=3.1.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from importlib-resources>=3.2.0->matplotlib) (3.17.0) Requirement already satisfied: six>=1.5 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from python-dateutil>=2.7->matplotlib) (1.16.0) Requirement already satisfied: tensorflow in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (2.10.0) Requirement already satisfied: absl-py>=1.0.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorflow) (1.4.0) Requirement already satisfied: astunparse>=1.6.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorflow) (1.6.3) Requirement already satisfied: flatbuffers>=2.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorflow) (24.3.25) Requirement already satisfied: gast<=0.4.0,>=0.2.1 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorflow) (0.4.0) Requirement already satisfied: google-pasta>=0.1.1 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorflow) (0.2.0) Requirement already satisfied: h5py>=2.9.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorflow) (3.11.0) Requirement already satisfied: keras-preprocessing>=1.1.1 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorflow) (1.1.2) Requirement already satisfied: libclang>=13.0.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorflow) (18.1.1) Requirement already satisfied: numpy>=1.20 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorflow) (1.24.4) Requirement already satisfied: opt-einsum>=2.3.2 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorflow) (3.3.0) Requirement already satisfied: packaging in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorflow) (23.2) Requirement already satisfied: protobuf<3.20,>=3.9.2 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorflow) (3.19.6) Requirement already satisfied: setuptools in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorflow) (68.2.2) Requirement already satisfied: six>=1.12.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorflow) (1.16.0) Requirement already satisfied: termcolor>=1.1.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorflow) (2.4.0) Requirement already satisfied: typing-extensions>=3.6.6 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorflow) (4.11.0) Requirement already satisfied: wrapt>=1.11.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorflow) (1.16.0) Requirement already satisfied: tensorflow-io-gcs-filesystem>=0.23.1 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorflow) (0.31.0) Requirement already satisfied: grpcio<2.0,>=1.24.3 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorflow) (1.62.2) Requirement already satisfied: tensorboard<2.11,>=2.10 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorflow) (2.10.1) Requirement already satisfied: tensorflow-estimator<2.11,>=2.10.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorflow) (2.10.0) Requirement already satisfied: keras<2.11,>=2.10.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorflow) (2.10.0) Requirement already satisfied: wheel<1.0,>=0.23.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from astunparse>=1.6.0->tensorflow) (0.41.2) Requirement already satisfied: google-auth<3,>=1.6.3 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorboard<2.11,>=2.10->tensorflow) (2.29.0) Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorboard<2.11,>=2.10->tensorflow) (0.4.6) Requirement already satisfied: markdown>=2.6.8 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorboard<2.11,>=2.10->tensorflow) (3.6) Requirement already satisfied: requests<3,>=2.21.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorboard<2.11,>=2.10->tensorflow) (2.31.0) Requirement already satisfied: tensorboard-data-server<0.7.0,>=0.6.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorboard<2.11,>=2.10->tensorflow) (0.6.1) Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorboard<2.11,>=2.10->tensorflow) (1.8.1) Requirement already satisfied: werkzeug>=1.0.1 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from tensorboard<2.11,>=2.10->tensorflow) (3.0.2) Requirement already satisfied: cachetools<6.0,>=2.0.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from google-auth<3,>=1.6.3->tensorboard<2.11,>=2.10->tensorflow) (5.3.3) Requirement already satisfied: pyasn1-modules>=0.2.1 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from google-auth<3,>=1.6.3->tensorboard<2.11,>=2.10->tensorflow) (0.4.0) Requirement already satisfied: rsa<5,>=3.1.4 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from google-auth<3,>=1.6.3->tensorboard<2.11,>=2.10->tensorflow) (4.9) Requirement already satisfied: requests-oauthlib>=0.7.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.11,>=2.10->tensorflow) (2.0.0) Requirement already satisfied: importlib-metadata>=4.4 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from markdown>=2.6.8->tensorboard<2.11,>=2.10->tensorflow) (7.0.1) Requirement already satisfied: charset-normalizer<4,>=2 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from requests<3,>=2.21.0->tensorboard<2.11,>=2.10->tensorflow) (2.0.4) Requirement already satisfied: idna<4,>=2.5 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from requests<3,>=2.21.0->tensorboard<2.11,>=2.10->tensorflow) (2.10) Requirement already satisfied: urllib3<3,>=1.21.1 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from requests<3,>=2.21.0->tensorboard<2.11,>=2.10->tensorflow) (2.1.0) Requirement already satisfied: certifi>=2017.4.17 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from requests<3,>=2.21.0->tensorboard<2.11,>=2.10->tensorflow) (2024.2.2) Requirement already satisfied: MarkupSafe>=2.1.1 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from werkzeug>=1.0.1->tensorboard<2.11,>=2.10->tensorflow) (2.1.3) Requirement already satisfied: zipp>=0.5 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from importlib-metadata>=4.4->markdown>=2.6.8->tensorboard<2.11,>=2.10->tensorflow) (3.17.0) Requirement already satisfied: pyasn1<0.7.0,>=0.4.6 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from pyasn1-modules>=0.2.1->google-auth<3,>=1.6.3->tensorboard<2.11,>=2.10->tensorflow) (0.6.0) Requirement already satisfied: oauthlib>=3.0.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.11,>=2.10->tensorflow) (3.2.2) Requirement already satisfied: torch in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (2.4.0) Requirement already satisfied: torchvision in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (0.19.0) Requirement already satisfied: torchaudio in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (2.4.0) Requirement already satisfied: filelock in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from torch) (3.14.0) Requirement already satisfied: typing-extensions>=4.8.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from torch) (4.11.0) Requirement already satisfied: sympy in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from torch) (1.12) Requirement already satisfied: networkx in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from torch) (3.1) Requirement already satisfied: jinja2 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from torch) (3.1.3) Requirement already satisfied: fsspec in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from torch) (2024.6.1) Requirement already satisfied: numpy<2 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from torchvision) (1.24.4) Requirement already satisfied: pillow!=8.3.*,>=5.3.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from torchvision) (10.3.0) Requirement already satisfied: MarkupSafe>=2.0 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from jinja2->torch) (2.1.3) Requirement already satisfied: mpmath>=0.19 in c:\users\genes\anaconda3\envs\gpu_env\lib\site-packages (from sympy->torch) (1.3.0)
Importing Libraries ¶
# Standard Libraries
import random
import os
import warnings
from collections import deque, namedtuple
# Numerical and Data Handling Libraries
import numpy as np
# Plotting and Visualization
import matplotlib.pyplot as plt
import imageio
from IPython.display import HTML, display
from PIL import Image
# Reinforcement Learning Environment
import gym
# Deep Learning Frameworks
# TensorFlow
import tensorflow as tf
from tensorflow.keras import Model
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Input
from tensorflow.keras.optimizers import Adam
# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
# Ignore warnings
warnings.filterwarnings("ignore")
Set up the Environment ¶
env = gym.make('Pendulum-v0')
Test Environment ¶
Environment Information ¶
num_states = env.observation_space.shape[0]
print("Size of State Space -> {}".format(num_states))
num_actions = env.action_space.shape[0]
print("Size of Action Space -> {}".format(num_actions))
upper_bound = env.action_space.high[0]
lower_bound = env.action_space.low[0]
print("Max Value of Action -> {}".format(upper_bound))
print("Min Value of Action -> {}".format(lower_bound))
Size of State Space -> 3 Size of Action Space -> 1 Max Value of Action -> 2.0 Min Value of Action -> -2.0
Check understanding of the state and action spaces¶
- The state space of the reinforcement learning environment has 3 dimensions, meaning the agent perceives its state with 3 distinct values (e.g. x, y, velocity).
- The action space has 1 dimension, indicating that the agent can only take a single action at a time.
- The maximum and minimum range of possible action values in the action space that can be taken are 2.0 and -2.0, respectively.
Trial Run ¶
episodes = 1
for episode in range(1, episodes+1):
state = env.reset()
done = False
score = 0
while not done:
action = [random.uniform(-2, 2)]
observation, reward, done, info = env.step(action)
score += reward
env.render()
print(f"Episode {episode}, Score: {score}")
env.close()
Episode 1, Score: -1488.7194405167888
What are we doing¶
This code runs a single episode in a reinforcement learning environment by resetting the environment, then repeatedly taking randomly generated actions within a specified range until the episode ends. After the episode completes, it prints the total score achieved and finally closes the environment to clean up resources.
The loop helps in testing the interaction with the environment by running a simple episode using random actions and observing how the agent performs in that environment.
Evaluation Metrics ¶
Rewards ¶
What this graph does¶
The plot_rewards function generates a line plot that displays the total reward accumulated by the agent per episode. This graph is crucial for understanding the learning progress of the RL model, as it shows how the agent's performance evolves over time.
An increasing trend in total reward indicates that the agent is learning effectively and improving its decision-making strategy.
Why is it a good evaluation¶
The plot_rewards_subplots function extends this analysis by comparing rewards across different configurations of the RL model. It creates subplots for each configuration, allowing for side-by-side comparison of how varying certain parameters impacts the agent's performance.
This is particularly useful for hyperparameter tuning, as it helps identify which parameter values lead to better rewards and thus more effective learning.
Overall¶
Overall, these graphs are valuable for evaluating the RL model's performance, as they provide insights into the model's ability to maximize rewards over time and across different settings. Analyzing these plots can help us make informed decisions about further training, parameter adjustments, and model improvements, ultimately leading to a more robust and efficient RL agent.
def plot_rewards(rewards):
plt.figure(figsize=(10, 5))
plt.plot(rewards, label='Reward per Episode')
plt.xlabel('Episode')
plt.ylabel('Total Reward')
plt.title('Total Reward per Episode')
plt.legend()
plt.grid(True)
plt.show()
def plot_rewards_subplots(rewards_list, parameter_values, parameter_name):
"""Plot rewards for different configurations as subplots with descriptive titles."""
num_plots = len(rewards_list)
fig, axes = plt.subplots(num_plots, 1, figsize=(10, 2 * num_plots))
for i, (rewards, value) in enumerate(zip(rewards_list, parameter_values)):
if num_plots == 1:
ax = axes
else:
ax = axes[i]
ax.plot(rewards)
ax.set_title(f'Rewards for {parameter_name} = {value}')
ax.set_xlabel('Episode')
ax.set_ylabel('Total Reward')
ax.grid(True)
plt.tight_layout()
plt.show()
Moving Average ¶
What this graph does¶
The compute_moving_average_and_plot and compute_moving_average_and_plot_subplots functions are designed to enhance the analysis of your Reinforcement Learning (RL) model's performance by smoothing out the reward data over episodes.
The moving average graph helps to reduce noise and highlight trends in the reward data, providing a clearer view of the agent's learning progress. By computing and plotting the moving average with a specified window size, these functions show how rewards evolve over time and help identify whether the agent's performance is stabilizing or improving.
Why is it a good evaluation¶
The moving average plot is a valuable evaluation tool because it helps to smooth out fluctuations in reward values that can be caused by the inherent variability in RL training.
This makes it easier to discern meaningful patterns and trends, such as whether the agent is consistently achieving higher rewards or if there are specific episodes where performance improves significantly. Additionally, the compute_moving_average_and_plot_subplots function allows for a comparative analysis across different configurations or parameters, offering insights into which settings lead to more stable or improved performance.
Overall¶
Overall, these graphs aid in assessing the effectiveness of the RL model's learning process and guide adjustments for better performance.
def compute_moving_average_and_plot(results, window_size = 5):
"""Compute the moving average of a list of data and plot it."""
# Compute the moving average
moving_avg = np.convolve(results, np.ones(window_size) / window_size, mode='valid')
# Plot the data and moving average
plt.figure(figsize=(12, 6))
plt.plot(np.arange(window_size-1, len(results)), moving_avg, color='orange', label='Moving Average')
plt.xlabel('Episode')
plt.ylabel('Reward')
plt.title('Moving Average')
plt.legend()
plt.grid(True)
plt.show()
def compute_moving_average_and_plot_subplots(results_list, values, title, window_size=5):
"""Create subplots to compute and plot moving averages for different result lists with customizable titles."""
num_plots = len(results_list)
fig, axes = plt.subplots(num_plots, 1, figsize=(12, 2 * num_plots))
if num_plots == 1:
axes = [axes] # Ensure axes is a list if there's only one subplot
for i, (results, value) in enumerate(zip(results_list, values)):
# Compute the moving average
moving_avg = np.convolve(results, np.ones(window_size) / window_size, mode='valid')
# Plot the moving average on the given axis
axes[i].plot(np.arange(window_size-1, len(results)), moving_avg, color='orange', label='Moving Average')
axes[i].set_xlabel('Episode')
axes[i].set_ylabel('Reward')
axes[i].set_title(f'Moving Average for {title} = {value}') # Use the provided title and value in the title
axes[i].legend()
axes[i].grid(True)
plt.tight_layout()
plt.show()
Display environment ¶
What this graph does¶
The functions display_gifs_in_grid is designed to visualize the output of your Reinforcement Learning (RL) model by displaying GIFs that illustrate the agent's performance over episodes.
The the display_gifs_in_grid function provides a broad view by arranging multiple GIFs in a grid layout.
Why is it a good evaluation¶
This can be useful for examining detailed agent behavior or performance at a specific point in training. By visually inspecting these GIFs, we can assess how the agent interacts with its environment, detect any anomalies or improvements, and get a clearer understanding of its decision-making process.
In addition, by placing the GIFs in a grid, you can easily spot trends, such as consistent patterns of improvement or areas where the agent might be struggling.
Overall¶
Overall, these visualizations are an excellent evaluation tool for your RL model because they offer a direct, visual representation of the agent's behavior and performance. This helps in understanding how well the agent is learning and adapting over time, identifying areas for improvement, and making informed adjustments to training strategies.
Multiple GIF¶
def display_gifs_in_grid(directory):
# List of GIF filenames
gif_files = [f for f in os.listdir(directory) if f.endswith('.gif')]
# Limit the number of GIFs to 20
gif_files = gif_files[:20]
# Define grid parameters
num_cols = 4 # Number of columns
gif_width = 200 # Width of each GIF
gif_height = 200 # Height of each GIF
# Create HTML content with CSS for displaying GIFs in a grid
html_content = f'''
<style>
.gif-grid {{
display: grid;
grid-template-columns: repeat({num_cols}, {gif_width}px);
gap: 10px;
}}
.gif-grid img {{
width: {gif_width}px;
height: {gif_height}px;
object-fit: contain;
}}
</style>
<div class="gif-grid">
'''
for gif_file in gif_files:
gif_path = os.path.join(directory, gif_file)
html_content += f'<img src="{gif_path}">'
html_content += '</div>'
# Display the HTML content
display(HTML(html_content))
Deep Q Network (DQN) ¶
What is Q Learning¶
Q-learning is a model-free, value-based, off-policy algorithm that will find the best series of actions based on the agent's current state. The “Q” stands for quality. Quality represents how valuable the action is in maximizing future rewards.
Components of Q Learning¶
States¶
The environment's different situations or configurations that the agent can be in.
Actions¶
The set of all possible moves or decisions the agent can take in each state.
Rewards¶
The feedback the agent receives from the environment after performing an action in a particular state.
Q-Values¶
A Q-value represents the expected future reward for taking action in state, and then following the optimal policy thereafter.
The goal of Q-learning is to learn these Q-values for all state-action pairs
Discount Factor¶
This parameter determines the importance of future rewards.
How Q Learning Works¶
The Q-learning algorithm works by initializing the Q-values for all state-action pairs arbitrarily. Then, it repeatedly interacts with the environment through episodes, where in each episode, the agent observes its current state and selects an action based on an exploration-exploitation strategy. One common strategy is the ε-greedy policy, which with a certain probability ε, chooses a random action (exploration), and with probability 1-ε, chooses the action with the highest Q-value for the current state (exploitation).
The agent performs the chosen action, observes the resulting reward and the next state, and then updates the Q-value for the state-action pair using a specific formula. This formula incorporates the learning rate (α), which determines how much the new information affects the old Q-value, and the discount factor (γ), which determines the importance of future rewards.
Through repeated application of this process, the Q-values converge to the true values, enabling the agent to act optimally by always choosing the action with the highest Q-value in each state.
What is DQN¶
DQN, or Deep Q-Network, is a deep reinforcement learning algorithm that combines Q-learning with deep neural networks to approximate the state-value function in complex environments
It is usually used in conjunction with Experience Replay, for storing the episode steps in memory for off-policy learning, where samples are drawn from the replay memory at random.
Main Components of DQN¶
Agent¶
The agent interacts with the environment by observing states and taking actions.
It uses a deep neural network (DNN) to approximate Q-values, which represent the expected future reward for each action in a given state.
Through experience replay and Q-learning updates, the agent learns to improve its policy over time.
Environment¶
The environment defines the task and provides feedback to the agent based on its actions.
It has a state space that represents the current situation or configuration and an action space that defines the possible actions the agent can take.
Actions taken by the agent influence the environment's state and yield rewards that the agent seeks to maximize.
Deep Neural Network (DNN)¶
The DNN in DQN acts as a function approximator for Q-values, estimating the expected future reward for each action in a given state.
It takes the environment's state as input and outputs a Q-value for each possible action. This allows DQNs to handle complex, high-dimensional state spaces like images.
Experience Replay¶
Experience replay is a technique where the agent's experiences (state, action, reward, next state, and episode termination) are stored in a replay buffer.
During training, the agent samples random minibatches of experiences from this buffer. This breaks the sequential correlation of experiences and enhances learning efficiency.
Target Network¶
To stabilize training, DQNs utilize two separate networks: the primary network (learning Q-values) and the target network.
The target network has the same architecture as the primary network but its weights are updated less frequently. It computes the target Q-values used in the loss function during training.
Periodically, the weights of the primary network are copied to the target network to update its parameters. This delayed update helps in stabilizing the learning process by providing a fixed target for a period.
Q-Learning Update Rule¶
The Q-learning update rule dictates how the DNN learns from experience. It involves computing the difference between the predicted Q-value and a target Q-value for each encountered state-action pair.
Epsilon-Greedy Policy¶
DQNs employ an epsilon-greedy policy to balance exploration (trying new actions) and exploitation (selecting actions based on current knowledge).
At each step, the agent selects a random action with probability epsilon (exploration) or the action with the highest predicted Q-value from the DNN with probability 1-epsilon (exploitation).
Epsilon starts high to encourage exploration early on and gradually decays over time to prioritize exploitation as the agent gains experience.
How DQN Works¶
Deep Q-Networks (DQN) work by combining Q-learning with deep neural networks to approximate the Q-value function, which represents the expected future rewards for taking a specific action in a given state. In DQN, a deep neural network called the policy network approximates this Q-value function, taking states as inputs and outputting Q-values for all possible actions. A separate target network, which is a delayed copy of the policy network, is used to provide stable target values during training.
How are DQN trained¶
The training begins by initializing a replay memory to store experiences, where each experience consists of a state, action, reward, and the next state. Two neural networks are initialized: a main network and a target network, which are identical in architecture but serve different roles. The agent interacts with the environment, selecting actions based on an exploration-exploitation strategy, typically ε-greedy, and stores the resulting experiences in the replay memory. During training, a mini-batch of experiences is randomly sampled from the replay memory to break the correlation between consecutive experiences and improve stability. For each experience in the mini-batch, target Q-values are computed using the target network based on the Bellman equation. The policy network is updated by minimizing the loss between the predicted Q-values from the policy network and the target Q-values from the target network. This update is performed using optimization algorithms like stochastic gradient descent. The target network is periodically updated to match the policy network, either by directly copying the weights or using a soft update mechanism. The training loop continues over many episodes, allowing the agent to learn a policy that maximizes cumulative rewards.
Difference Between Q Learning and DQN¶
Learning¶
Q-learning is a model-free reinforcement learning algorithm. It learns an optimal action-value function by iteratively updating Q-values based on experienced transitions. Q-learning typically operates with a discrete and small state and action space and directly updates a table (Q-table) that stores Q-values for all state-action pairs.
DQN extends Q-learning to handle environments with high-dimensional state spaces, such as raw pixel inputs from video games. It uses a deep neural network (DNN) to approximate the Q-function. DQN employs deep learning techniques to approximate Q-values and handle continuous state spaces effectively.
Action selection¶
In Q-learning, the agent begins by maintaining a Q-table, which serves as a repository for Q-values corresponding to every state-action pair it encounters during its interactions with the environment. When faced with a decision point, the agent first observes the current state it is in. It then consults its Q-table to retrieve the Q-values associated with all possible actions available in that particular state.
DQN replaces the Q-table with a deep neural network (DNN). This network takes the current state as input and outputs Q-values for all possible actions. During action selection, DQN computes Q-values using the neural network and selects the action with the highest Q-value (exploitation).
What will we be aiming for DQN¶
We will be aiming for a stable training for DQN as instability can lead to poor performance, convergence issues, and inefficient learning.
References
https://www.datacamp.com/tutorial/introduction-q-learning-beginner-tutorial
https://paperswithcode.com/method/dqn
https://www.geeksforgeeks.org/q-learning-in-python/
https://www.baeldung.com/cs/q-learning-vs-deep-q-learning-vs-deep-q-network
def save_model_weights(agent, model_dir):
if not os.path.exists(model_dir):
os.makedirs(model_dir)
weights_path = os.path.join(model_dir, 'tensorflow_dqn_weights.h5')
agent.model.save_weights(weights_path)
Base Model ¶
Memory Class¶
This class is a critical component of experience replay in DQN. It stores transition tuples of (state, action, reward, next_state), which helps break the temporal correlation between consecutive experiences and improves learning stability. The update method adds transitions to the memory, and the sample method retrieves a random batch for training.
Net Class¶
This defines the neural network architecture used to approximate the Q-value function. The model consists of two hidden layers with ReLU activation and an output layer with a linear activation function, suitable for estimating continuous Q-values for each action.
Agent Class¶
This class encapsulates the DQN agent's functionality, including policy action selection (epsilon-greedy), experience storage, model training (learn method), and target model updating. The select_action method balances exploration and exploitation, while the learn method updates the model by minimizing the mean squared error between predicted and target Q-values. The update_target_model method periodically updates the target network to stabilize training.
Main Function¶
The main loop runs episodes in the environment, collects rewards, stores experiences, and trains the model. It also manages the rendering of the environment for visualization and saves GIFs of training progress.
class Memory:
# Constructor, the capacity is the maximum size of the memory. Once capacity is reached, the old memories are removed.
def __init__(self, capacity):
self.memory = deque(maxlen=capacity)
'''
Adds transition to the memory. The transition is a tuple of (state, action, reward, next_state).
If the maximum capacity of memory is reached, the old memories are overrided.
'''
def update(self, transition):
self.memory.append(transition)
'''
Retrieve a random sample from memory, batch size indicates the number of samples to be retrieved.
'''
def sample(self, batch_size):
return random.sample(self.memory, batch_size)
class Net(tf.keras.Model):
# constructor initializes the layers
def __init__(self, input_size, num_actions):
super(Net, self).__init__()
self.dense1 = Dense(64, activation='relu', input_shape=(input_size,))
self.dense2 = Dense(64, activation='relu')
self.output_layer = Dense(num_actions, activation='linear')
'''
This function is the forward pass of the model. It takes the state as input and returns the Q values for each action.
'x' is the input to the model, which is the state from the environment. 'x' is then passed through the 2 dense
layers and the output layer to get the predicted values for each action. We use a linear activation function 'relu' to
predict a wide range of values, to estimate action values in the RL model
'''
def call(self, x):
# first dense layer
x = self.dense1(x)
# second dense layer
x = self.dense2(x)
return self.output_layer(x)
class Agent:
# this constructor initializes the environment, model, memory, and other variables required for the agent
def __init__(self,
env_string,
num_actions,
state_size,
batch_size=32,
learning_rate=0.01,
gamma=0.98,
epsilon=1.0,
update_target_every_this_episode = 3,
memory_capacity=10000):
self.env_string = env_string
self.env = gym.make(env_string)
self.env.reset()
self.state_size = state_size
self.num_actions = num_actions
self.action_scope = [i * 4.0 / (num_actions - 1) - 2.0 for i in range(num_actions)] # Adjusted to match action scaling in dueling DQN
self.batch_size = batch_size
self.gamma = gamma
self.epsilon = epsilon
self.update_target_every_this_episode = update_target_every_this_episode
self.memory = Memory(memory_capacity)
self.best_total_reward = float('-inf')
# Initialize the Models
self.model = Net(state_size, num_actions)
self.target_model = Net(state_size, num_actions)
self.optimizer = Adam(learning_rate)
self.loss_fn = tf.losses.MeanSquaredError()
'''
Implements the epsilon-greedy policy. With probability epsilon, a random action is selected, otherwise the action chosen will
be the one with the highest Q value. The epsilon value is decayed over time to reduce the exploration as the agent learns.
It also converts the state to a tensor before passing through to the model.
'''
def select_action(self, state):
# if random number is less than epsilon, return random action
if np.random.rand() < self.epsilon:
return np.random.randint(self.num_actions), None
# convert to tensor
state = tf.convert_to_tensor([state], dtype=tf.float32)
q_values = self.model(state)
# else, return action with highest Q value
return np.argmax(q_values.numpy()), None
# Store experience tuple into the memory buffer, function is above
def store_transition(self, state, action, reward, next_state):
self.memory.update((state, action, reward, next_state, False))
'''
Learn function performs single step training on a batch of experiences sampled from the memory buffer.
'''
def learn(self):
if len(self.memory.memory) < self.batch_size:
return
# sample a batch of transitions from the memory
transitions = self.memory.sample(self.batch_size)
# extract the states, actions, rewards, next states, from the batch
state_batch, action_batch, reward_batch, next_state_batch, _ = zip(*transitions)
# convert the s, a, r, s_' to tensors
state_batch = tf.convert_to_tensor(state_batch, dtype=tf.float32)
action_batch = tf.convert_to_tensor(action_batch, dtype=tf.int32)
reward_batch = tf.convert_to_tensor(reward_batch, dtype=tf.float32)
next_state_batch = tf.convert_to_tensor(next_state_batch, dtype=tf.float32)
'''
Tensorflow GradientTape is an API for automatic differentiation. It records operations for automatic differentiation.
This function will calculate the predicted Q-values from the current state and actions.
'''
with tf.GradientTape() as tape:
'''
Forward pass through the DQN model to get the Q-values for all actions given the current batch of states.
q-values contains the predicted q-values for each action in each state of the batch
'''
q_values = self.model(state_batch)
'''
'action_indices' calculates the indices of the actions taken in the Q-value matrix.
Since q_values contains Q-values for all actions, this step is necessary to select only the Q-values corresponding to the actions that were actually taken.
'''
action_indices = tf.range(self.batch_size) * self.num_actions + action_batch
'''
Reshapes the Q-value matrix to a single vector, then selects the Q-values for the actions taken using the 'action_indices' calculated above.
'''
predicted_q = tf.gather(tf.reshape(q_values, [-1]), action_indices)
'''
Forward pass through the target DQN model to get Q-values for all actions given the next states.
'''
next_q_values = self.target_model(next_state_batch)
# Finds the maximum Q-value among all actions for each next state, which represents the best possible future reward achievable from the next state.
max_next_q = tf.reduce_max(next_q_values, axis=1)
'''
Calculates the target Q-values using the immediate reward received (reward_batch) and the
discounted maximum future reward (self.gamma * max_next_q). This forms the update target for the Q-value of the action taken.
'''
target_q = reward_batch + self.gamma * max_next_q
# Calculates the loss between the predicted Q-values and the target Q-values, basically MSE loss
loss = self.loss_fn(target_q, predicted_q)
# Calculate the gradients of the loss with respect to the model parameters
grads = tape.gradient(loss, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))
'''
Hard update the model every certain amount of episodes
'''
def update_target_model(self):
self.target_model.set_weights(self.model.get_weights())
# checks if current model is the best model based on total reward
def is_best_model(self, total_reward):
return total_reward > self.best_total_reward
# updates the best model with the current model
def update_best_model(self, total_reward):
self.best_total_reward = total_reward
# Main function to run the training
def main():
env_string = 'Pendulum-v0'
num_actions = 40
state_size = gym.make(env_string).observation_space.shape[0]
agent = Agent(env_string, num_actions, state_size)
episodes = 150
rewards = [] # Initialize rewards list to store total rewards per episode
for ep in range(episodes):
state = agent.env.reset()
done = False
total_reward = 0
# Capture frames for GIF
frames = []
while not done:
# Render the environment and capture frames
frames.append(agent.env.render(mode='rgb_array'))
action_index, _ = agent.select_action(state)
action = [agent.action_scope[action_index]]
next_state, reward, done, _ = agent.env.step(action)
agent.store_transition(state, action_index, reward, next_state)
state = next_state
total_reward += reward
agent.learn()
rewards.append(total_reward) # Append the total reward of the episode to the rewards list
# Update the taget model
if (ep + 1) % agent.update_target_every_this_episode == 0:
agent.update_target_model()
print(f"Target model updated at episode {ep + 1}.")
print(f"Episode: {ep+1}, Total Reward: {total_reward:.2f}, Epsilon: {agent.epsilon:.2f}")
if agent.is_best_model(total_reward):
# Update and save the best model weights
agent.update_best_model(total_reward)
model_dir = 'Weights/DQN/Base'
save_model_weights(agent, model_dir)
print("Best model weights saved.")
# Directory where you want to save the files
save_dir = 'training_animations/DQN/Base'
# Create the directory if it doesn't exist
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Save frames as GIF
gif_filename = os.path.join(save_dir, f'episode_{ep+1}.gif')
imageio.mimsave(gif_filename, frames, duration=0.00333, loop=0) # Adjust duration as needed
# Close the environment
agent.env.close()
return rewards
rewards = main()
Episode: 1, Total Reward: -1020.17, Epsilon: 1.00 Best model weights saved. Episode: 2, Total Reward: -1621.36, Epsilon: 1.00 Target model updated at episode 3. Episode: 3, Total Reward: -1476.38, Epsilon: 1.00 Episode: 4, Total Reward: -867.42, Epsilon: 1.00 Best model weights saved. Episode: 5, Total Reward: -1182.32, Epsilon: 1.00 Target model updated at episode 6. Episode: 6, Total Reward: -968.25, Epsilon: 1.00 Episode: 7, Total Reward: -1332.69, Epsilon: 1.00 Episode: 8, Total Reward: -1484.43, Epsilon: 1.00 Target model updated at episode 9. Episode: 9, Total Reward: -1266.92, Epsilon: 1.00 Episode: 10, Total Reward: -1687.78, Epsilon: 1.00 Episode: 11, Total Reward: -902.20, Epsilon: 1.00 Target model updated at episode 12. Episode: 12, Total Reward: -1587.59, Epsilon: 1.00 Episode: 13, Total Reward: -860.49, Epsilon: 1.00 Best model weights saved. Episode: 14, Total Reward: -1277.07, Epsilon: 1.00 Target model updated at episode 15. Episode: 15, Total Reward: -1178.99, Epsilon: 1.00 Episode: 16, Total Reward: -1704.26, Epsilon: 1.00 Episode: 17, Total Reward: -1072.48, Epsilon: 1.00 Target model updated at episode 18. Episode: 18, Total Reward: -1169.22, Epsilon: 1.00 Episode: 19, Total Reward: -753.48, Epsilon: 1.00 Best model weights saved. Episode: 20, Total Reward: -708.47, Epsilon: 1.00 Best model weights saved. Target model updated at episode 21. Episode: 21, Total Reward: -967.37, Epsilon: 1.00 Episode: 22, Total Reward: -1174.34, Epsilon: 1.00 Episode: 23, Total Reward: -971.79, Epsilon: 1.00 Target model updated at episode 24. Episode: 24, Total Reward: -971.36, Epsilon: 1.00 Episode: 25, Total Reward: -1063.54, Epsilon: 1.00 Episode: 26, Total Reward: -1408.52, Epsilon: 1.00 Target model updated at episode 27. Episode: 27, Total Reward: -1594.92, Epsilon: 1.00 Episode: 28, Total Reward: -1317.79, Epsilon: 1.00 Episode: 29, Total Reward: -1290.44, Epsilon: 1.00 Target model updated at episode 30. Episode: 30, Total Reward: -1429.06, Epsilon: 1.00 Episode: 31, Total Reward: -1364.61, Epsilon: 1.00 Episode: 32, Total Reward: -1055.81, Epsilon: 1.00 Target model updated at episode 33. Episode: 33, Total Reward: -1513.24, Epsilon: 1.00 Episode: 34, Total Reward: -867.53, Epsilon: 1.00 Episode: 35, Total Reward: -1759.95, Epsilon: 1.00 Target model updated at episode 36. Episode: 36, Total Reward: -1519.45, Epsilon: 1.00 Episode: 37, Total Reward: -1289.75, Epsilon: 1.00 Episode: 38, Total Reward: -1367.02, Epsilon: 1.00 Target model updated at episode 39. Episode: 39, Total Reward: -885.22, Epsilon: 1.00 Episode: 40, Total Reward: -1248.02, Epsilon: 1.00 Episode: 41, Total Reward: -1607.37, Epsilon: 1.00 Target model updated at episode 42. Episode: 42, Total Reward: -1450.36, Epsilon: 1.00 Episode: 43, Total Reward: -1190.09, Epsilon: 1.00 Episode: 44, Total Reward: -1164.31, Epsilon: 1.00 Target model updated at episode 45. Episode: 45, Total Reward: -871.57, Epsilon: 1.00 Episode: 46, Total Reward: -1311.09, Epsilon: 1.00 Episode: 47, Total Reward: -1632.11, Epsilon: 1.00 Target model updated at episode 48. Episode: 48, Total Reward: -865.58, Epsilon: 1.00 Episode: 49, Total Reward: -1487.48, Epsilon: 1.00 Episode: 50, Total Reward: -924.35, Epsilon: 1.00 Target model updated at episode 51. Episode: 51, Total Reward: -1621.30, Epsilon: 1.00 Episode: 52, Total Reward: -1605.31, Epsilon: 1.00 Episode: 53, Total Reward: -1514.18, Epsilon: 1.00 Target model updated at episode 54. Episode: 54, Total Reward: -746.35, Epsilon: 1.00 Episode: 55, Total Reward: -816.08, Epsilon: 1.00 Episode: 56, Total Reward: -1066.85, Epsilon: 1.00 Target model updated at episode 57. Episode: 57, Total Reward: -971.25, Epsilon: 1.00 Episode: 58, Total Reward: -1143.86, Epsilon: 1.00 Episode: 59, Total Reward: -1280.52, Epsilon: 1.00 Target model updated at episode 60. Episode: 60, Total Reward: -1646.70, Epsilon: 1.00 Episode: 61, Total Reward: -1185.21, Epsilon: 1.00 Episode: 62, Total Reward: -1160.65, Epsilon: 1.00 Target model updated at episode 63. Episode: 63, Total Reward: -1333.64, Epsilon: 1.00 Episode: 64, Total Reward: -1365.47, Epsilon: 1.00 Episode: 65, Total Reward: -1179.39, Epsilon: 1.00 Target model updated at episode 66. Episode: 66, Total Reward: -1659.64, Epsilon: 1.00 Episode: 67, Total Reward: -1511.65, Epsilon: 1.00 Episode: 68, Total Reward: -1067.93, Epsilon: 1.00 Target model updated at episode 69. Episode: 69, Total Reward: -1260.76, Epsilon: 1.00 Episode: 70, Total Reward: -952.45, Epsilon: 1.00 Episode: 71, Total Reward: -1339.19, Epsilon: 1.00 Target model updated at episode 72. Episode: 72, Total Reward: -1344.81, Epsilon: 1.00 Episode: 73, Total Reward: -1561.77, Epsilon: 1.00 Episode: 74, Total Reward: -1180.96, Epsilon: 1.00 Target model updated at episode 75. Episode: 75, Total Reward: -957.74, Epsilon: 1.00 Episode: 76, Total Reward: -1073.90, Epsilon: 1.00 Episode: 77, Total Reward: -1066.65, Epsilon: 1.00 Target model updated at episode 78. Episode: 78, Total Reward: -859.25, Epsilon: 1.00 Episode: 79, Total Reward: -1614.02, Epsilon: 1.00 Episode: 80, Total Reward: -1703.10, Epsilon: 1.00 Target model updated at episode 81. Episode: 81, Total Reward: -1585.33, Epsilon: 1.00 Episode: 82, Total Reward: -1473.06, Epsilon: 1.00 Episode: 83, Total Reward: -1090.09, Epsilon: 1.00 Target model updated at episode 84. Episode: 84, Total Reward: -1069.16, Epsilon: 1.00 Episode: 85, Total Reward: -1518.75, Epsilon: 1.00 Episode: 86, Total Reward: -1089.77, Epsilon: 1.00 Target model updated at episode 87. Episode: 87, Total Reward: -1338.75, Epsilon: 1.00 Episode: 88, Total Reward: -1356.38, Epsilon: 1.00 Episode: 89, Total Reward: -1169.65, Epsilon: 1.00 Target model updated at episode 90. Episode: 90, Total Reward: -1025.54, Epsilon: 1.00 Episode: 91, Total Reward: -893.34, Epsilon: 1.00 Episode: 92, Total Reward: -975.42, Epsilon: 1.00 Target model updated at episode 93. Episode: 93, Total Reward: -1214.16, Epsilon: 1.00 Episode: 94, Total Reward: -1261.20, Epsilon: 1.00 Episode: 95, Total Reward: -1169.80, Epsilon: 1.00 Target model updated at episode 96. Episode: 96, Total Reward: -967.78, Epsilon: 1.00 Episode: 97, Total Reward: -1715.65, Epsilon: 1.00 Episode: 98, Total Reward: -1410.67, Epsilon: 1.00 Target model updated at episode 99. Episode: 99, Total Reward: -1319.11, Epsilon: 1.00 Episode: 100, Total Reward: -1182.18, Epsilon: 1.00 Episode: 101, Total Reward: -1732.26, Epsilon: 1.00 Target model updated at episode 102. Episode: 102, Total Reward: -1005.52, Epsilon: 1.00 Episode: 103, Total Reward: -1591.40, Epsilon: 1.00 Episode: 104, Total Reward: -1596.56, Epsilon: 1.00 Target model updated at episode 105. Episode: 105, Total Reward: -1660.84, Epsilon: 1.00 Episode: 106, Total Reward: -1352.63, Epsilon: 1.00 Episode: 107, Total Reward: -1707.48, Epsilon: 1.00 Target model updated at episode 108. Episode: 108, Total Reward: -1618.23, Epsilon: 1.00 Episode: 109, Total Reward: -1101.18, Epsilon: 1.00 Episode: 110, Total Reward: -1452.40, Epsilon: 1.00 Target model updated at episode 111. Episode: 111, Total Reward: -1765.38, Epsilon: 1.00 Episode: 112, Total Reward: -1104.03, Epsilon: 1.00 Episode: 113, Total Reward: -886.13, Epsilon: 1.00 Target model updated at episode 114. Episode: 114, Total Reward: -1420.91, Epsilon: 1.00 Episode: 115, Total Reward: -1196.24, Epsilon: 1.00 Episode: 116, Total Reward: -1174.38, Epsilon: 1.00 Target model updated at episode 117. Episode: 117, Total Reward: -1600.52, Epsilon: 1.00 Episode: 118, Total Reward: -1263.99, Epsilon: 1.00 Episode: 119, Total Reward: -819.07, Epsilon: 1.00 Target model updated at episode 120. Episode: 120, Total Reward: -1000.05, Epsilon: 1.00 Episode: 121, Total Reward: -1190.98, Epsilon: 1.00 Episode: 122, Total Reward: -892.61, Epsilon: 1.00 Target model updated at episode 123. Episode: 123, Total Reward: -899.95, Epsilon: 1.00 Episode: 124, Total Reward: -865.62, Epsilon: 1.00 Episode: 125, Total Reward: -1485.65, Epsilon: 1.00 Target model updated at episode 126. Episode: 126, Total Reward: -1758.95, Epsilon: 1.00 Episode: 127, Total Reward: -1600.77, Epsilon: 1.00 Episode: 128, Total Reward: -850.71, Epsilon: 1.00 Target model updated at episode 129. Episode: 129, Total Reward: -1195.55, Epsilon: 1.00 Episode: 130, Total Reward: -1318.62, Epsilon: 1.00 Episode: 131, Total Reward: -1701.99, Epsilon: 1.00 Target model updated at episode 132. Episode: 132, Total Reward: -1453.28, Epsilon: 1.00 Episode: 133, Total Reward: -838.32, Epsilon: 1.00 Episode: 134, Total Reward: -1347.59, Epsilon: 1.00 Target model updated at episode 135. Episode: 135, Total Reward: -812.70, Epsilon: 1.00 Episode: 136, Total Reward: -1178.16, Epsilon: 1.00 Episode: 137, Total Reward: -962.46, Epsilon: 1.00 Target model updated at episode 138. Episode: 138, Total Reward: -1465.09, Epsilon: 1.00 Episode: 139, Total Reward: -1263.67, Epsilon: 1.00 Episode: 140, Total Reward: -907.39, Epsilon: 1.00 Target model updated at episode 141. Episode: 141, Total Reward: -906.45, Epsilon: 1.00 Episode: 142, Total Reward: -1485.26, Epsilon: 1.00 Episode: 143, Total Reward: -961.16, Epsilon: 1.00 Target model updated at episode 144. Episode: 144, Total Reward: -1213.01, Epsilon: 1.00 Episode: 145, Total Reward: -1270.57, Epsilon: 1.00 Episode: 146, Total Reward: -1542.83, Epsilon: 1.00 Target model updated at episode 147. Episode: 147, Total Reward: -1159.93, Epsilon: 1.00 Episode: 148, Total Reward: -949.46, Epsilon: 1.00 Episode: 149, Total Reward: -1068.87, Epsilon: 1.00 Target model updated at episode 150. Episode: 150, Total Reward: -976.06, Epsilon: 1.00
plot_rewards(rewards)
compute_moving_average_and_plot(rewards)
GIF of highest reward attempt¶
Results¶
The DQN model is not performing well. The DQN Model is unable to balance itself, as seen from the GIF. When the pendulum reaches the top, the Model is unable to maintain its position, failing to balance it at the vertical position.
Additionally, from the Rewards plot and moving average plot, the training is not stable at all. There are violent spikes across the whole graph, suggesting high volatility in the learning process. This instability may be due to the fact that the DQN is exploring most of the time, as the epsilon value remains constantly high. This excessive exploration prevents the model from effectively exploiting learned strategies.
Additionally, the rewards are also not very good. At the highest point, the reward is only around -750, suggesting poor performance. In the context of the pendulum balancing task, this low reward indicates that the model is far from achieving the goal of keeping the pendulum upright for extended periods. A well-performing model would be expected to achieve significantly higher rewards, ideally approaching or exceeding zero, which would indicate successful balancing.
Exploration vs Exploitation ¶
What changed from the previous model¶
- Implemented Epsilon-Greedy Action Selection Policy
What is Epsilon-Greedy Action Selection Policy¶
In epsilon-greedy action selection, the agent uses both exploitations to take advantage of prior knowledge and exploration to look for new options. The epsilon-greedy approach selects the action with the highest estimated reward most of the time. The aim is to have a balance between exploration and exploitation. Exploration allows us to have some room for trying new things, sometimes contradicting what we have already learned.
What is the epsilon parameter¶
In the epsilon-greedy action selection policy, the epsilon (ε) parameter represents the probability with which the agent will choose a random action, as opposed to selecting the action that has the highest estimated reward (Highest Q Value).
When ε is high (close to 1), the agent explores more often, choosing random actions a significant portion of the time. (Discover new actions or strategies)
When ε is low (close to 0), the agent exploits its current knowledge more, selecting the action with the highest estimated reward most of the time. (Encourages exploitation of known actions that have previously led to high rewards)
How Epsilon-Greedy Action Selection Policy Works¶
Epsilon Decay¶
Epsilon decay is a strategy used in reinforcement learning to gradually reduce the exploration rate over time. It start with a high exploration rate (high epsilon) to encourage the agent to explore a wide range of actions and then progressively decrease it to shift the focus towards exploitation of the best-known actions as learning progresses.
Initial Value¶
Set a high initial value for epsilon, such as 1.0, which means the agent will initially explore randomly with a high probability.
After each episodes¶
Epsilon is reduced over time or based on the number of episodes
Minimum Epsilon¶
Set a lower bound or minimum value for epsilon to ensure that there is always some probability of exploration. This prevents the agent from becoming too exploitative and getting stuck in local optima.
How Epsilon-Greedy Action Selection Policy Helps with stability¶
Helps balance discovering new strategies and leveraging known ones. Early in training, exploration encourages trying various actions to gather diverse experiences, while over time, decreasing exploration allows focusing on the best-performing strategies. This balance prevents the agent from getting stuck in suboptimal behaviors and supports stable, effective learning.
How Epsilon-Greedy Action Selection Policy Helps to imporve the model¶
Exploration¶
By choosing random actions with a probability ϵ, the policy explores various states and actions, helping discover better strategies that might not be found if only the current best-known actions are chosen.
Exploitation¶
With a probability of 1−ϵ, the policy exploits the best-known action based on the current Q-values, ensuring the agent leverages its learned knowledge to maximize rewards.
Avoiding Local Minima¶
Exploration helps the model avoid getting stuck in local minima by occasionally trying less optimal actions, which can lead to discovering better long-term strategies.
References
class Memory:
# Constructor, the capacity is the maximum size of the memory. Once capacity is reached, the old memories are removed.
def __init__(self, capacity):
self.memory = deque(maxlen=capacity)
'''
Adds transition to the memory. The transition is a tuple of (state, action, reward, next_state).
If the maximum capacity of memory is reached, the old memories are overrided.
'''
def update(self, transition):
self.memory.append(transition)
'''
Retrieve a random sample from memory, batch size indicates the number of samples to be retrieved.
'''
def sample(self, batch_size):
return random.sample(self.memory, batch_size)
class Net(tf.keras.Model):
# constructor initializes the layers
def __init__(self, input_size, num_actions):
super(Net, self).__init__()
self.dense1 = Dense(64, activation='relu', input_shape=(input_size,))
self.dense2 = Dense(64, activation='relu')
self.output_layer = Dense(num_actions, activation='linear')
'''
This function is the forward pass of the model. It takes the state as input and returns the Q values for each action.
'x' is the input to the model, which is the state from the environment. 'x' is then passed through the 2 dense
layers and the output layer to get the predicted values for each action. We use a linear activation function 'relu' to
predict a wide range of values, to estimate action values in the RL model
'''
def call(self, x):
# first dense layer
x = self.dense1(x)
# second dense layer
x = self.dense2(x)
return self.output_layer(x)
class Agent:
# this constructor initializes the environment, model, memory, and other variables required for the agent
def __init__(self,
env_string,
num_actions,
state_size,
batch_size=32,
learning_rate=0.01,
gamma=0.98,
epsilon=1.0,
epsilon_decay=0.98,
epsilon_min=0.01,
update_target_every_this_episode = 3,
memory_capacity=10000):
self.env_string = env_string
self.env = gym.make(env_string)
self.env.reset()
self.state_size = state_size
self.num_actions = num_actions
self.action_scope = [i * 4.0 / (num_actions - 1) - 2.0 for i in range(num_actions)] # Adjusted to match action scaling in dueling DQN
self.batch_size = batch_size
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
self.update_target_every_this_episode = update_target_every_this_episode
self.memory = Memory(memory_capacity)
self.best_total_reward = float('-inf')
self.model = Net(state_size, num_actions)
self.target_model = Net(state_size, num_actions)
self.optimizer = Adam(learning_rate)
self.loss_fn = tf.losses.MeanSquaredError()
'''
Implements the epsilon-greedy policy. With probability epsilon, a random action is selected, otherwise the action chosen will
be the one with the highest Q value. The epsilon value is decayed over time to reduce the exploration as the agent learns.
It also converts the state to a tensor before passing through to the model.
'''
def select_action(self, state):
# if random number is less than epsilon, return random action
if np.random.rand() < self.epsilon:
return np.random.randint(self.num_actions), None
# convert to tensor
state = tf.convert_to_tensor([state], dtype=tf.float32)
q_values = self.model(state)
# else, return action with highest Q value
return np.argmax(q_values.numpy()), None
# Store experience tuple into the memory buffer, function is above
def store_transition(self, state, action, reward, next_state):
self.memory.update((state, action, reward, next_state, False))
'''
Learn function performs single step training on a batch of experiences sampled from the memory buffer.
'''
def learn(self):
if len(self.memory.memory) < self.batch_size:
return
# sample a batch of transitions from the memory
transitions = self.memory.sample(self.batch_size)
# extract the states, actions, rewards, next states, from the batch
state_batch, action_batch, reward_batch, next_state_batch, _ = zip(*transitions)
# convert the s, a, r, s_' to tensors
state_batch = tf.convert_to_tensor(state_batch, dtype=tf.float32)
action_batch = tf.convert_to_tensor(action_batch, dtype=tf.int32)
reward_batch = tf.convert_to_tensor(reward_batch, dtype=tf.float32)
next_state_batch = tf.convert_to_tensor(next_state_batch, dtype=tf.float32)
'''
Tensorflow GradientTape is an API for automatic differentiation. It records operations for automatic differentiation.
This function will calculate the predicted Q-values from the current state and actions.
'''
with tf.GradientTape() as tape:
'''
Forward pass through the DQN model to get the Q-values for all actions given the current batch of states.
q-values contains the predicted q-values for each action in each state of the batch
'''
q_values = self.model(state_batch)
'''
'action_indices' calculates the indices of the actions taken in the Q-value matrix.
Since q_values contains Q-values for all actions, this step is necessary to select only the Q-values corresponding to the actions that were actually taken.
'''
action_indices = tf.range(self.batch_size) * self.num_actions + action_batch
'''
Reshapes the Q-value matrix to a single vector, then selects the Q-values for the actions taken using the 'action_indices' calculated above.
'''
predicted_q = tf.gather(tf.reshape(q_values, [-1]), action_indices)
'''
Forward pass through the target DQN model to get Q-values for all actions given the next states.
'''
next_q_values = self.target_model(next_state_batch)
# Finds the maximum Q-value among all actions for each next state, which represents the best possible future reward achievable from the next state.
max_next_q = tf.reduce_max(next_q_values, axis=1)
'''
Calculates the target Q-values using the immediate reward received (reward_batch) and the
discounted maximum future reward (self.gamma * max_next_q). This forms the update target for the Q-value of the action taken.
'''
target_q = reward_batch + self.gamma * max_next_q
# Calculates the loss between the predicted Q-values and the target Q-values, basically MSE loss
loss = self.loss_fn(target_q, predicted_q)
# Calculate the gradients of the loss with respect to the model parameters
grads = tape.gradient(loss, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))
self.update_epsilon()
# update the epsilon value
def update_epsilon(self):
self.epsilon = max(self.epsilon_min, self.epsilon_decay * self.epsilon)
# update the model
def update_target_model(self):
self.target_model.set_weights(self.model.get_weights())
# checks if current model is the best model based on total reward
def is_best_model(self, total_reward):
return total_reward > self.best_total_reward
# updates the best model with the current model
def update_best_model(self, total_reward):
self.best_total_reward = total_reward
# Main function to run the training
def main():
env_string = 'Pendulum-v0'
num_actions = 40
state_size = gym.make(env_string).observation_space.shape[0]
agent = Agent(env_string, num_actions, state_size)
episodes = 150
rewards = [] # Initialize rewards list to store total rewards per episode
for ep in range(episodes):
state = agent.env.reset()
done = False
total_reward = 0
# Capture frames for GIF
frames = []
while not done:
# Render the environment and capture frames
frames.append(agent.env.render(mode='rgb_array'))
action_index, _ = agent.select_action(state)
action = [agent.action_scope[action_index]]
next_state, reward, done, _ = agent.env.step(action)
agent.store_transition(state, action_index, reward, next_state)
state = next_state
total_reward += reward
agent.learn()
rewards.append(total_reward) # Append the total reward of the episode to the rewards list
# Update the taget model
if (ep + 1) % agent.update_target_every_this_episode == 0:
agent.update_target_model()
print(f"Target model updated at episode {ep + 1}.")
print(f"Episode: {ep+1}, Total Reward: {total_reward:.2f}, Epsilon: {agent.epsilon:.2f}")
if agent.is_best_model(total_reward):
# Update and save the best model weights
agent.update_best_model(total_reward)
model_dir = 'Weights/DQN/epsilonGreedy'
save_model_weights(agent, model_dir)
print("Best model weights saved.")
# Directory where you want to save the files
save_dir = 'training_animations/DQN/epsilonGreedy'
# Create the directory if it doesn't exist
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Save frames as GIF
gif_filename = os.path.join(save_dir, f'episode_{ep+1}.gif')
imageio.mimsave(gif_filename, frames, duration=0.00333, loop=0) # Adjust duration as needed
# Close the environment
agent.env.close()
return rewards
rewards = main()
Episode: 1, Total Reward: -845.93, Epsilon: 0.03 Best model weights saved. Episode: 2, Total Reward: -996.02, Epsilon: 0.01 Target model updated at episode 3. Episode: 3, Total Reward: -876.12, Epsilon: 0.01 Episode: 4, Total Reward: -1620.64, Epsilon: 0.01 Episode: 5, Total Reward: -1573.13, Epsilon: 0.01 Target model updated at episode 6. Episode: 6, Total Reward: -1746.41, Epsilon: 0.01 Episode: 7, Total Reward: -1559.63, Epsilon: 0.01 Episode: 8, Total Reward: -1556.62, Epsilon: 0.01 Target model updated at episode 9. Episode: 9, Total Reward: -1264.40, Epsilon: 0.01 Episode: 10, Total Reward: -1357.14, Epsilon: 0.01 Episode: 11, Total Reward: -1729.91, Epsilon: 0.01 Target model updated at episode 12. Episode: 12, Total Reward: -1523.17, Epsilon: 0.01 Episode: 13, Total Reward: -1145.14, Epsilon: 0.01 Episode: 14, Total Reward: -1554.36, Epsilon: 0.01 Target model updated at episode 15. Episode: 15, Total Reward: -1432.93, Epsilon: 0.01 Episode: 16, Total Reward: -1101.85, Epsilon: 0.01 Episode: 17, Total Reward: -1229.44, Epsilon: 0.01 Target model updated at episode 18. Episode: 18, Total Reward: -1437.08, Epsilon: 0.01 Episode: 19, Total Reward: -1627.74, Epsilon: 0.01 Episode: 20, Total Reward: -886.84, Epsilon: 0.01 Target model updated at episode 21. Episode: 21, Total Reward: -1200.54, Epsilon: 0.01 Episode: 22, Total Reward: -1411.27, Epsilon: 0.01 Episode: 23, Total Reward: -1001.35, Epsilon: 0.01 Target model updated at episode 24. Episode: 24, Total Reward: -1194.51, Epsilon: 0.01 Episode: 25, Total Reward: -879.41, Epsilon: 0.01 Episode: 26, Total Reward: -825.19, Epsilon: 0.01 Best model weights saved. Target model updated at episode 27. Episode: 27, Total Reward: -1061.77, Epsilon: 0.01 Episode: 28, Total Reward: -263.75, Epsilon: 0.01 Best model weights saved. Episode: 29, Total Reward: -389.11, Epsilon: 0.01 Target model updated at episode 30. Episode: 30, Total Reward: -137.16, Epsilon: 0.01 Best model weights saved. Episode: 31, Total Reward: -778.64, Epsilon: 0.01 Episode: 32, Total Reward: -657.04, Epsilon: 0.01 Target model updated at episode 33. Episode: 33, Total Reward: -265.17, Epsilon: 0.01 Episode: 34, Total Reward: -516.10, Epsilon: 0.01 Episode: 35, Total Reward: -521.10, Epsilon: 0.01 Target model updated at episode 36. Episode: 36, Total Reward: -636.05, Epsilon: 0.01 Episode: 37, Total Reward: -393.16, Epsilon: 0.01 Episode: 38, Total Reward: -569.47, Epsilon: 0.01 Target model updated at episode 39. Episode: 39, Total Reward: -133.76, Epsilon: 0.01 Best model weights saved. Episode: 40, Total Reward: -794.32, Epsilon: 0.01 Episode: 41, Total Reward: -647.93, Epsilon: 0.01 Target model updated at episode 42. Episode: 42, Total Reward: -249.53, Epsilon: 0.01 Episode: 43, Total Reward: -629.62, Epsilon: 0.01 Episode: 44, Total Reward: -256.66, Epsilon: 0.01 Target model updated at episode 45. Episode: 45, Total Reward: -248.12, Epsilon: 0.01 Episode: 46, Total Reward: -129.03, Epsilon: 0.01 Best model weights saved. Episode: 47, Total Reward: -496.65, Epsilon: 0.01 Target model updated at episode 48. Episode: 48, Total Reward: -258.52, Epsilon: 0.01 Episode: 49, Total Reward: -525.34, Epsilon: 0.01 Episode: 50, Total Reward: -125.27, Epsilon: 0.01 Best model weights saved. Target model updated at episode 51. Episode: 51, Total Reward: -3.43, Epsilon: 0.01 Best model weights saved. Episode: 52, Total Reward: -502.62, Epsilon: 0.01 Episode: 53, Total Reward: -5.66, Epsilon: 0.01 Target model updated at episode 54. Episode: 54, Total Reward: -130.47, Epsilon: 0.01 Episode: 55, Total Reward: -235.44, Epsilon: 0.01 Episode: 56, Total Reward: -258.91, Epsilon: 0.01 Target model updated at episode 57. Episode: 57, Total Reward: -389.92, Epsilon: 0.01 Episode: 58, Total Reward: -255.52, Epsilon: 0.01 Episode: 59, Total Reward: -505.36, Epsilon: 0.01 Target model updated at episode 60. Episode: 60, Total Reward: -2.85, Epsilon: 0.01 Best model weights saved. Episode: 61, Total Reward: -261.82, Epsilon: 0.01 Episode: 62, Total Reward: -520.60, Epsilon: 0.01 Target model updated at episode 63. Episode: 63, Total Reward: -259.11, Epsilon: 0.01 Episode: 64, Total Reward: -632.28, Epsilon: 0.01 Episode: 65, Total Reward: -389.65, Epsilon: 0.01 Target model updated at episode 66. Episode: 66, Total Reward: -128.57, Epsilon: 0.01 Episode: 67, Total Reward: -121.78, Epsilon: 0.01 Episode: 68, Total Reward: -122.47, Epsilon: 0.01 Target model updated at episode 69. Episode: 69, Total Reward: -398.21, Epsilon: 0.01 Episode: 70, Total Reward: -129.38, Epsilon: 0.01 Episode: 71, Total Reward: -245.80, Epsilon: 0.01 Target model updated at episode 72. Episode: 72, Total Reward: -124.01, Epsilon: 0.01 Episode: 73, Total Reward: -122.64, Epsilon: 0.01 Episode: 74, Total Reward: -132.01, Epsilon: 0.01 Target model updated at episode 75. Episode: 75, Total Reward: -237.44, Epsilon: 0.01 Episode: 76, Total Reward: -126.89, Epsilon: 0.01 Episode: 77, Total Reward: -130.52, Epsilon: 0.01 Target model updated at episode 78. Episode: 78, Total Reward: -128.23, Epsilon: 0.01 Episode: 79, Total Reward: -498.87, Epsilon: 0.01 Episode: 80, Total Reward: -244.44, Epsilon: 0.01 Target model updated at episode 81. Episode: 81, Total Reward: -249.15, Epsilon: 0.01 Episode: 82, Total Reward: -129.43, Epsilon: 0.01 Episode: 83, Total Reward: -375.25, Epsilon: 0.01 Target model updated at episode 84. Episode: 84, Total Reward: -363.86, Epsilon: 0.01 Episode: 85, Total Reward: -5.67, Epsilon: 0.01 Episode: 86, Total Reward: -131.38, Epsilon: 0.01 Target model updated at episode 87. Episode: 87, Total Reward: -390.59, Epsilon: 0.01 Episode: 88, Total Reward: -130.75, Epsilon: 0.01 Episode: 89, Total Reward: -4.45, Epsilon: 0.01 Target model updated at episode 90. Episode: 90, Total Reward: -659.61, Epsilon: 0.01 Episode: 91, Total Reward: -6.95, Epsilon: 0.01 Episode: 92, Total Reward: -128.07, Epsilon: 0.01 Target model updated at episode 93. Episode: 93, Total Reward: -385.84, Epsilon: 0.01 Episode: 94, Total Reward: -718.24, Epsilon: 0.01 Episode: 95, Total Reward: -131.99, Epsilon: 0.01 Target model updated at episode 96. Episode: 96, Total Reward: -128.65, Epsilon: 0.01 Episode: 97, Total Reward: -241.99, Epsilon: 0.01 Episode: 98, Total Reward: -4.42, Epsilon: 0.01 Target model updated at episode 99. Episode: 99, Total Reward: -253.34, Epsilon: 0.01 Episode: 100, Total Reward: -8.97, Epsilon: 0.01 Episode: 101, Total Reward: -250.90, Epsilon: 0.01 Target model updated at episode 102. Episode: 102, Total Reward: -123.83, Epsilon: 0.01 Episode: 103, Total Reward: -4.37, Epsilon: 0.01 Episode: 104, Total Reward: -123.80, Epsilon: 0.01 Target model updated at episode 105. Episode: 105, Total Reward: -121.56, Epsilon: 0.01 Episode: 106, Total Reward: -1379.04, Epsilon: 0.01 Episode: 107, Total Reward: -3.51, Epsilon: 0.01 Target model updated at episode 108. Episode: 108, Total Reward: -230.41, Epsilon: 0.01 Episode: 109, Total Reward: -252.78, Epsilon: 0.01 Episode: 110, Total Reward: -856.83, Epsilon: 0.01 Target model updated at episode 111. Episode: 111, Total Reward: -241.72, Epsilon: 0.01 Episode: 112, Total Reward: -122.51, Epsilon: 0.01 Episode: 113, Total Reward: -4.30, Epsilon: 0.01 Target model updated at episode 114. Episode: 114, Total Reward: -404.86, Epsilon: 0.01 Episode: 115, Total Reward: -4.12, Epsilon: 0.01 Episode: 116, Total Reward: -6.18, Epsilon: 0.01 Target model updated at episode 117. Episode: 117, Total Reward: -346.91, Epsilon: 0.01 Episode: 118, Total Reward: -122.83, Epsilon: 0.01 Episode: 119, Total Reward: -447.60, Epsilon: 0.01 Target model updated at episode 120. Episode: 120, Total Reward: -363.93, Epsilon: 0.01 Episode: 121, Total Reward: -443.87, Epsilon: 0.01 Episode: 122, Total Reward: -900.90, Epsilon: 0.01 Target model updated at episode 123. Episode: 123, Total Reward: -122.05, Epsilon: 0.01 Episode: 124, Total Reward: -775.28, Epsilon: 0.01 Episode: 125, Total Reward: -357.36, Epsilon: 0.01 Target model updated at episode 126. Episode: 126, Total Reward: -346.54, Epsilon: 0.01 Episode: 127, Total Reward: -363.85, Epsilon: 0.01 Episode: 128, Total Reward: -1198.72, Epsilon: 0.01 Target model updated at episode 129. Episode: 129, Total Reward: -125.10, Epsilon: 0.01 Episode: 130, Total Reward: -128.25, Epsilon: 0.01 Episode: 131, Total Reward: -4.34, Epsilon: 0.01 Target model updated at episode 132. Episode: 132, Total Reward: -258.01, Epsilon: 0.01 Episode: 133, Total Reward: -400.90, Epsilon: 0.01 Episode: 134, Total Reward: -128.45, Epsilon: 0.01 Target model updated at episode 135. Episode: 135, Total Reward: -381.99, Epsilon: 0.01 Episode: 136, Total Reward: -131.44, Epsilon: 0.01 Episode: 137, Total Reward: -250.35, Epsilon: 0.01 Target model updated at episode 138. Episode: 138, Total Reward: -568.77, Epsilon: 0.01 Episode: 139, Total Reward: -248.47, Epsilon: 0.01 Episode: 140, Total Reward: -121.99, Epsilon: 0.01 Target model updated at episode 141. Episode: 141, Total Reward: -125.93, Epsilon: 0.01 Episode: 142, Total Reward: -127.95, Epsilon: 0.01 Episode: 143, Total Reward: -126.41, Epsilon: 0.01 Target model updated at episode 144. Episode: 144, Total Reward: -127.13, Epsilon: 0.01 Episode: 145, Total Reward: -240.13, Epsilon: 0.01 Episode: 146, Total Reward: -254.25, Epsilon: 0.01 Target model updated at episode 147. Episode: 147, Total Reward: -250.35, Epsilon: 0.01 Episode: 148, Total Reward: -124.31, Epsilon: 0.01 Episode: 149, Total Reward: -1061.75, Epsilon: 0.01 Target model updated at episode 150. Episode: 150, Total Reward: -127.62, Epsilon: 0.01
plot_rewards(rewards)
compute_moving_average_and_plot(rewards)
GIF of highest reward attempt¶
Results¶
The model has improved!
From the GIF, we can see that when the pendulum is in the upright position, the DQN is able to balance it and keep it steady. This suggests that the DQN has successfully learned the optimal policy for maintaining the pendulum's balance, indicating effective training and convergence of the model.
The stability of the training graphs has also improved, although it's not yet ideal. Previously, the graphs showed violent spikes with no consistent increase in rewards throughout the episodes. However, after implementing the epsilon-greedy strategy, the spikes have become less severe, and there is a noticeable upward trend in training performance. Around 20 episodes, the rewards begin to stabilize, which suggests that the model is starting to converge towards an optimal policy.
Additionally, the rewards are approaching a value close to 0, indicating that the model is effectively minimizing the error in maintaining the pendulum's upright position. This reduction in error suggests that the DQN is becoming proficient at balancing the pendulum and achieving the desired outcome.
Using Soft updates instead of hard updates ¶
What did we do¶
Instead of updating the model every few episodes, we opted for a soft update, where the target network's weights are gradually adjusted using a small update parameter.
Why do we update the target model¶
The Q-values estimated by the current model can be noisy and changing quickly due to the high variance in experiences. The target model, which is updated less frequently, provides a more stable target for the Q-values, leading to better and more reliable learning performance.
What is Hard Updates¶
In hard updates, the target model's weights are replaced entirely with the weights of the main (or policy) model at regular intervals. Essentially, every few episodes or training steps, the target model's weights are set to be exactly the same as those of the main model.
What is Soft updates¶
In soft updates, the target model's weights are updated incrementally to be a blend of the current target model weights and the main model weights. This is done using a parameter called τ (tau), which controls the proportion of the update.
Hard update VS Soft update in terms of stability¶
Hard updates involve replacing the target model’s weights with those of the main model at fixed intervals, typically every few thousand episodes or iterations. While this method is straightforward, it can introduce instability. Hard updates cause sudden and significant shifts in the target values because the entire set of weights is replaced at once. These abrupt changes can destabilize the learning process, as the target values used for computing the loss can fluctuate dramatically. Additionally, the target model might quickly adapt to recent experiences, potentially leading to overfitting on recent data and ignoring the broader experience gathered over time.
Soft updates involve gradually blending the weights of the main model with the target model using a small factor, τ (tau). This approach tends to be more stable as soft updates allow the target model to evolve more smoothly, as only a fraction of the weights are updated at each step. This gradual transition helps maintain stability in the learning process by avoiding large fluctuations in target values. By avoiding abrupt changes, soft updates reduce the risk of training instability and oscillations, leading to a more stable learning process and often faster convergence to optimal policies.
Hard update VS Soft update in terms of learning speed¶
A hard update involves periodically copying the weights from the main network to the target network. This approach is straightforward but can lead to instability, as the sudden changes in the target network can cause large fluctuations in the learning process. This instability can slow down the learning speed because the model may need more time to adjust to the abrupt changes.
A soft update gradually adjusts the target network's weights towards the main network's weights using a small update rate. This method smooths the transition between network states, leading to more stable learning. By reducing the variance in updates, soft updates can help the model converge faster and more reliably.
What we hope to see after implementing soft updates¶
Improved learning stability: Soft updates provide a smoother transition of weights from the main network to the target network, reducing sudden changes that can destabilize learning
Faster convergence: By gradually updating the target network, the model can adapt more quickly to new information, potentially leading to faster overall convergence
References
class Memory:
# Constructor, the capacity is the maximum size of the memory. Once capacity is reached, the old memories are removed.
def __init__(self, capacity):
self.memory = deque(maxlen=capacity)
'''
Adds transition to the memory. The transition is a tuple of (state, action, reward, next_state).
If the maximum capacity of memory is reached, the old memories are overrided.
'''
def update(self, transition):
self.memory.append(transition)
'''
Retrieve a random sample from memory, batch size indicates the number of samples to be retrieved.
'''
def sample(self, batch_size):
return random.sample(self.memory, batch_size)
class Net(tf.keras.Model):
# constructor initializes the layers
def __init__(self, input_size, num_actions):
super(Net, self).__init__()
self.dense1 = Dense(64, activation='relu', input_shape=(input_size,))
self.dense2 = Dense(64, activation='relu')
self.output_layer = Dense(num_actions, activation='linear')
'''
This function is the forward pass of the model. It takes the state as input and returns the Q values for each action.
'x' is the input to the model, which is the state from the environment. 'x' is then passed through the 2 dense
layers and the output layer to get the predicted values for each action. We use a linear activation function 'relu' to
predict a wide range of values, to estimate action values in the RL model
'''
def call(self, x):
# first dense layer
x = self.dense1(x)
# second dense layer
x = self.dense2(x)
return self.output_layer(x)
class Agent:
# this constructor initializes the environment, model, memory, and other variables required for the agent
def __init__(self, env_string, num_actions, state_size, batch_size=32, learning_rate=0.01, gamma=0.98, epsilon=1.0, epsilon_decay=0.98, epsilon_min=0.01, tau=0.01, memory_capacity=10000):
self.env_string = env_string
self.env = gym.make(env_string)
self.env.reset()
self.state_size = state_size
self.num_actions = num_actions
self.action_scope = [i * 4.0 / (num_actions - 1) - 2.0 for i in range(num_actions)] # Adjusted to match action scaling in dueling DQN
self.batch_size = batch_size
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
self.tau = tau
self.memory = Memory(memory_capacity)
self.best_total_reward = float('-inf')
self.model = Net(state_size, num_actions)
self.target_model = Net(state_size, num_actions)
self.optimizer = Adam(learning_rate)
self.loss_fn = tf.losses.MeanSquaredError()
'''
Implements the epsilon-greedy policy. With probability epsilon, a random action is selected, otherwise the action chosen will
be the one with the highest Q value. The epsilon value is decayed over time to reduce the exploration as the agent learns.
It also converts the state to a tensor before passing through to the model.
'''
def select_action(self, state):
# if random number is less than epsilon, return random action
if np.random.rand() < self.epsilon:
return np.random.randint(self.num_actions), None
# convert to tensor
state = tf.convert_to_tensor([state], dtype=tf.float32)
q_values = self.model(state)
# else, return action with highest Q value
return np.argmax(q_values.numpy()), None
# Store experience tuple into the memory buffer, function is above
def store_transition(self, state, action, reward, next_state):
self.memory.update((state, action, reward, next_state, False))
'''
Learn function performs single step training on a batch of experiences sampled from the memory buffer.
'''
def learn(self):
if len(self.memory.memory) < self.batch_size:
return
# sample a batch of transitions from the memory
transitions = self.memory.sample(self.batch_size)
# extract the states, actions, rewards, next states, from the batch
state_batch, action_batch, reward_batch, next_state_batch, _ = zip(*transitions)
# convert the s, a, r, s_' to tensors
state_batch = tf.convert_to_tensor(state_batch, dtype=tf.float32)
action_batch = tf.convert_to_tensor(action_batch, dtype=tf.int32)
reward_batch = tf.convert_to_tensor(reward_batch, dtype=tf.float32)
next_state_batch = tf.convert_to_tensor(next_state_batch, dtype=tf.float32)
'''
Tensorflow GradientTape is an API for automatic differentiation. It records operations for automatic differentiation.
This function will calculate the predicted Q-values from the current state and actions.
'''
with tf.GradientTape() as tape:
'''
Forward pass through the DQN model to get the Q-values for all actions given the current batch of states.
q-values contains the predicted q-values for each action in each state of the batch
'''
q_values = self.model(state_batch)
'''
'action_indices' calculates the indices of the actions taken in the Q-value matrix.
Since q_values contains Q-values for all actions, this step is necessary to select only the Q-values corresponding to the actions that were actually taken.
'''
action_indices = tf.range(self.batch_size) * self.num_actions + action_batch
'''
Reshapes the Q-value matrix to a single vector, then selects the Q-values for the actions taken using the 'action_indices' calculated above.
'''
predicted_q = tf.gather(tf.reshape(q_values, [-1]), action_indices)
'''
Forward pass through the target DQN model to get Q-values for all actions given the next states.
'''
next_q_values = self.target_model(next_state_batch)
# Finds the maximum Q-value among all actions for each next state, which represents the best possible future reward achievable from the next state.
max_next_q = tf.reduce_max(next_q_values, axis=1)
'''
Calculates the target Q-values using the immediate reward received (reward_batch) and the
discounted maximum future reward (self.gamma * max_next_q). This forms the update target for the Q-value of the action taken.
'''
target_q = reward_batch + self.gamma * max_next_q
# Calculates the loss between the predicted Q-values and the target Q-values, basically MSE loss
loss = self.loss_fn(target_q, predicted_q)
# Calculate the gradients of the loss with respect to the model parameters
grads = tape.gradient(loss, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))
self.update_epsilon()
# update the epsilon value
def update_epsilon(self):
self.epsilon = max(self.epsilon_min, self.epsilon_decay * self.epsilon)
# update the model
def update_target_model(self):
main_weights = self.model.get_weights()
target_weights = self.target_model.get_weights()
for i in range(len(target_weights)):
target_weights[i] = self.tau * main_weights[i] + (1 - self.tau) * target_weights[i]
self.target_model.set_weights(target_weights)
# checks if current model is the best model based on total reward
def is_best_model(self, total_reward):
return total_reward > self.best_total_reward
# updates the best model with the current model
def update_best_model(self, total_reward):
self.best_total_reward = total_reward
# Main function to run the training
def main():
env_string = 'Pendulum-v0'
num_actions = 40
state_size = gym.make(env_string).observation_space.shape[0]
agent = Agent(env_string, num_actions, state_size)
episodes = 150
rewards = [] # Initialize rewards list to store total rewards per episode
for ep in range(episodes):
state = agent.env.reset()
done = False
total_reward = 0
# Capture frames for GIF
frames = []
while not done:
# Render the environment and capture frames
frames.append(agent.env.render(mode='rgb_array'))
action_index, _ = agent.select_action(state)
action = [agent.action_scope[action_index]]
next_state, reward, done, _ = agent.env.step(action)
agent.store_transition(state, action_index, reward, next_state)
state = next_state
total_reward += reward
agent.learn()
agent.update_target_model()
rewards.append(total_reward) # Append the total reward of the episode to the rewards list
print(f"Episode: {ep+1}, Total Reward: {total_reward:.2f}, Epsilon: {agent.epsilon:.2f}")
if agent.is_best_model(total_reward):
# Update and save the best model weights
agent.update_best_model(total_reward)
model_dir = 'Weights/DQN/softUpdate'
save_model_weights(agent, model_dir)
print("Best model weights saved.")
# Directory where you want to save the files
save_dir = 'training_animations/DQN/softUpdate'
# Create the directory if it doesn't exist
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Save frames as GIF
gif_filename = os.path.join(save_dir, f'episode_{ep+1}.gif')
imageio.mimsave(gif_filename, frames, duration=0.00333, loop=0) # Adjust duration as needed
# Close the environment
agent.env.close()
return rewards
rewards = main()
Episode: 1, Total Reward: -883.04, Epsilon: 0.03 Best model weights saved. Episode: 2, Total Reward: -1074.48, Epsilon: 0.01 Episode: 3, Total Reward: -1404.75, Epsilon: 0.01 Episode: 4, Total Reward: -1313.33, Epsilon: 0.01 Episode: 5, Total Reward: -861.22, Epsilon: 0.01 Best model weights saved. Episode: 6, Total Reward: -1472.56, Epsilon: 0.01 Episode: 7, Total Reward: -125.73, Epsilon: 0.01 Best model weights saved. Episode: 8, Total Reward: -377.56, Epsilon: 0.01 Episode: 9, Total Reward: -1311.26, Epsilon: 0.01 Episode: 10, Total Reward: -1190.65, Epsilon: 0.01 Episode: 11, Total Reward: -122.91, Epsilon: 0.01 Best model weights saved. Episode: 12, Total Reward: -127.02, Epsilon: 0.01 Episode: 13, Total Reward: -378.43, Epsilon: 0.01 Episode: 14, Total Reward: -1.22, Epsilon: 0.01 Best model weights saved. Episode: 15, Total Reward: -120.34, Epsilon: 0.01 Episode: 16, Total Reward: -935.97, Epsilon: 0.01 Episode: 17, Total Reward: -489.86, Epsilon: 0.01 Episode: 18, Total Reward: -252.22, Epsilon: 0.01 Episode: 19, Total Reward: -619.12, Epsilon: 0.01 Episode: 20, Total Reward: -121.08, Epsilon: 0.01 Episode: 21, Total Reward: -6.03, Epsilon: 0.01 Episode: 22, Total Reward: -125.35, Epsilon: 0.01 Episode: 23, Total Reward: -6.65, Epsilon: 0.01 Episode: 24, Total Reward: -364.99, Epsilon: 0.01 Episode: 25, Total Reward: -244.74, Epsilon: 0.01 Episode: 26, Total Reward: -687.64, Epsilon: 0.01 Episode: 27, Total Reward: -416.33, Epsilon: 0.01 Episode: 28, Total Reward: -126.29, Epsilon: 0.01 Episode: 29, Total Reward: -1.84, Epsilon: 0.01 Episode: 30, Total Reward: -123.19, Epsilon: 0.01 Episode: 31, Total Reward: -127.87, Epsilon: 0.01 Episode: 32, Total Reward: -388.73, Epsilon: 0.01 Episode: 33, Total Reward: -643.90, Epsilon: 0.01 Episode: 34, Total Reward: -674.63, Epsilon: 0.01 Episode: 35, Total Reward: -122.16, Epsilon: 0.01 Episode: 36, Total Reward: -128.56, Epsilon: 0.01 Episode: 37, Total Reward: -245.06, Epsilon: 0.01 Episode: 38, Total Reward: -128.09, Epsilon: 0.01 Episode: 39, Total Reward: -2.26, Epsilon: 0.01 Episode: 40, Total Reward: -365.53, Epsilon: 0.01 Episode: 41, Total Reward: -864.91, Epsilon: 0.01 Episode: 42, Total Reward: -2.92, Epsilon: 0.01 Episode: 43, Total Reward: -122.16, Epsilon: 0.01 Episode: 44, Total Reward: -124.38, Epsilon: 0.01 Episode: 45, Total Reward: -243.11, Epsilon: 0.01 Episode: 46, Total Reward: -245.05, Epsilon: 0.01 Episode: 47, Total Reward: -239.45, Epsilon: 0.01 Episode: 48, Total Reward: -391.27, Epsilon: 0.01 Episode: 49, Total Reward: -126.30, Epsilon: 0.01 Episode: 50, Total Reward: -474.51, Epsilon: 0.01 Episode: 51, Total Reward: -244.26, Epsilon: 0.01 Episode: 52, Total Reward: -358.45, Epsilon: 0.01 Episode: 53, Total Reward: -129.23, Epsilon: 0.01 Episode: 54, Total Reward: -352.58, Epsilon: 0.01 Episode: 55, Total Reward: -125.68, Epsilon: 0.01 Episode: 56, Total Reward: -126.21, Epsilon: 0.01 Episode: 57, Total Reward: -6.25, Epsilon: 0.01 Episode: 58, Total Reward: -236.86, Epsilon: 0.01 Episode: 59, Total Reward: -697.38, Epsilon: 0.01 Episode: 60, Total Reward: -238.51, Epsilon: 0.01 Episode: 61, Total Reward: -252.27, Epsilon: 0.01 Episode: 62, Total Reward: -402.05, Epsilon: 0.01 Episode: 63, Total Reward: -370.59, Epsilon: 0.01 Episode: 64, Total Reward: -128.53, Epsilon: 0.01 Episode: 65, Total Reward: -127.61, Epsilon: 0.01 Episode: 66, Total Reward: -369.93, Epsilon: 0.01 Episode: 67, Total Reward: -132.09, Epsilon: 0.01 Episode: 68, Total Reward: -125.48, Epsilon: 0.01 Episode: 69, Total Reward: -808.86, Epsilon: 0.01 Episode: 70, Total Reward: -5.24, Epsilon: 0.01 Episode: 71, Total Reward: -657.30, Epsilon: 0.01 Episode: 72, Total Reward: -6.21, Epsilon: 0.01 Episode: 73, Total Reward: -368.28, Epsilon: 0.01 Episode: 74, Total Reward: -130.37, Epsilon: 0.01 Episode: 75, Total Reward: -130.59, Epsilon: 0.01 Episode: 76, Total Reward: -132.76, Epsilon: 0.01 Episode: 77, Total Reward: -351.19, Epsilon: 0.01 Episode: 78, Total Reward: -241.09, Epsilon: 0.01 Episode: 79, Total Reward: -132.70, Epsilon: 0.01 Episode: 80, Total Reward: -632.73, Epsilon: 0.01 Episode: 81, Total Reward: -7.28, Epsilon: 0.01 Episode: 82, Total Reward: -502.58, Epsilon: 0.01 Episode: 83, Total Reward: -138.08, Epsilon: 0.01 Episode: 84, Total Reward: -510.42, Epsilon: 0.01 Episode: 85, Total Reward: -257.95, Epsilon: 0.01 Episode: 86, Total Reward: -384.04, Epsilon: 0.01 Episode: 87, Total Reward: -133.00, Epsilon: 0.01 Episode: 88, Total Reward: -259.84, Epsilon: 0.01 Episode: 89, Total Reward: -731.03, Epsilon: 0.01 Episode: 90, Total Reward: -128.09, Epsilon: 0.01 Episode: 91, Total Reward: -127.01, Epsilon: 0.01 Episode: 92, Total Reward: -225.76, Epsilon: 0.01 Episode: 93, Total Reward: -251.66, Epsilon: 0.01 Episode: 94, Total Reward: -135.51, Epsilon: 0.01 Episode: 95, Total Reward: -8.34, Epsilon: 0.01 Episode: 96, Total Reward: -448.66, Epsilon: 0.01 Episode: 97, Total Reward: -505.24, Epsilon: 0.01 Episode: 98, Total Reward: -504.03, Epsilon: 0.01 Episode: 99, Total Reward: -129.89, Epsilon: 0.01 Episode: 100, Total Reward: -1319.27, Epsilon: 0.01 Episode: 101, Total Reward: -128.13, Epsilon: 0.01 Episode: 102, Total Reward: -258.08, Epsilon: 0.01 Episode: 103, Total Reward: -120.20, Epsilon: 0.01 Episode: 104, Total Reward: -124.53, Epsilon: 0.01 Episode: 105, Total Reward: -1179.15, Epsilon: 0.01 Episode: 106, Total Reward: -8.94, Epsilon: 0.01 Episode: 107, Total Reward: -498.72, Epsilon: 0.01 Episode: 108, Total Reward: -385.21, Epsilon: 0.01 Episode: 109, Total Reward: -122.40, Epsilon: 0.01 Episode: 110, Total Reward: -123.21, Epsilon: 0.01 Episode: 111, Total Reward: -250.80, Epsilon: 0.01 Episode: 112, Total Reward: -758.79, Epsilon: 0.01 Episode: 113, Total Reward: -131.66, Epsilon: 0.01 Episode: 114, Total Reward: -128.69, Epsilon: 0.01 Episode: 115, Total Reward: -131.27, Epsilon: 0.01 Episode: 116, Total Reward: -248.97, Epsilon: 0.01 Episode: 117, Total Reward: -903.34, Epsilon: 0.01 Episode: 118, Total Reward: -134.86, Epsilon: 0.01 Episode: 119, Total Reward: -125.48, Epsilon: 0.01 Episode: 120, Total Reward: -124.59, Epsilon: 0.01 Episode: 121, Total Reward: -237.94, Epsilon: 0.01 Episode: 122, Total Reward: -363.07, Epsilon: 0.01 Episode: 123, Total Reward: -371.58, Epsilon: 0.01 Episode: 124, Total Reward: -133.03, Epsilon: 0.01 Episode: 125, Total Reward: -250.87, Epsilon: 0.01 Episode: 126, Total Reward: -401.72, Epsilon: 0.01 Episode: 127, Total Reward: -382.18, Epsilon: 0.01 Episode: 128, Total Reward: -387.46, Epsilon: 0.01 Episode: 129, Total Reward: -493.44, Epsilon: 0.01 Episode: 130, Total Reward: -254.51, Epsilon: 0.01 Episode: 131, Total Reward: -382.79, Epsilon: 0.01 Episode: 132, Total Reward: -258.17, Epsilon: 0.01 Episode: 133, Total Reward: -699.46, Epsilon: 0.01 Episode: 134, Total Reward: -131.28, Epsilon: 0.01 Episode: 135, Total Reward: -131.94, Epsilon: 0.01 Episode: 136, Total Reward: -14.57, Epsilon: 0.01 Episode: 137, Total Reward: -361.99, Epsilon: 0.01 Episode: 138, Total Reward: -832.75, Epsilon: 0.01 Episode: 139, Total Reward: -375.00, Epsilon: 0.01 Episode: 140, Total Reward: -247.13, Epsilon: 0.01 Episode: 141, Total Reward: -730.13, Epsilon: 0.01 Episode: 142, Total Reward: -734.50, Epsilon: 0.01 Episode: 143, Total Reward: -264.97, Epsilon: 0.01 Episode: 144, Total Reward: -127.14, Epsilon: 0.01 Episode: 145, Total Reward: -422.78, Epsilon: 0.01 Episode: 146, Total Reward: -122.52, Epsilon: 0.01 Episode: 147, Total Reward: -356.22, Epsilon: 0.01 Episode: 148, Total Reward: -137.87, Epsilon: 0.01 Episode: 149, Total Reward: -259.25, Epsilon: 0.01 Episode: 150, Total Reward: -127.97, Epsilon: 0.01
plot_rewards(rewards)
compute_moving_average_and_plot(rewards)
GIF of highest reward attempt¶
Results¶
The model has improved!
Similar to the previous model, the DQN model after implementing soft update is able to balance it self, as seen from the GIF, we can see that when the pendulum is in the upright position, the DQN is able to balance it and keep it steady. This suggests that the DQN has successfully learned the optimal policy for maintaining the pendulum's balance, indicating effective training and convergence of the model.
We observe that the model converges more quickly, achieving near-zero rewards within approximately 20 episodes. Additionally, the model demonstrates greater stability during the initial 20 episodes, as indicated by the moving average curve, which shows fewer severe spikes compared to the previous model.
However, after the initial 20 episodes, once the model reaches optimal rewards, it begins to fluctuate. This fluctuation suggests that while the model has found a good policy, it may still be exploring or experiencing minor instabilities. These fluctuations could indicate that the model is occasionally exploring alternative strategies due to the epsilon-greedy policy.
Decrease the number of Episodes ¶
What did we do¶
- Number of episodes decreased from 150 to 25
Why did we decrease the number of episodes¶
The previous model showed improvement in terms of convergence speed and stability within the first 20 episodes. However, beyond this point, the rewards fluctuated within a narrow range of -500 to -100, indicating that further training did not lead to significant improvements. To address this, we decided to decrease the number of episodes to around 25. This decision was made to ensure the model focuses on achieving stability and optimal performance.
By limiting the number of episodes, we aim to prevent overfitting and unnecessary exploration, which can lead to instability and fluctuations in performance
class Memory:
# Constructor, the capacity is the maximum size of the memory. Once capacity is reached, the old memories are removed.
def __init__(self, capacity):
self.memory = deque(maxlen=capacity)
'''
Adds transition to the memory. The transition is a tuple of (state, action, reward, next_state).
If the maximum capacity of memory is reached, the old memories are overrided.
'''
def update(self, transition):
self.memory.append(transition)
'''
Retrieve a random sample from memory, batch size indicates the number of samples to be retrieved.
'''
def sample(self, batch_size):
return random.sample(self.memory, batch_size)
class Net(tf.keras.Model):
# constructor initializes the layers
def __init__(self, input_size, num_actions):
super(Net, self).__init__()
self.dense1 = Dense(64, activation='relu', input_shape=(input_size,))
self.dense2 = Dense(64, activation='relu')
self.output_layer = Dense(num_actions, activation='linear')
'''
This function is the forward pass of the model. It takes the state as input and returns the Q values for each action.
'x' is the input to the model, which is the state from the environment. 'x' is then passed through the 2 dense
layers and the output layer to get the predicted values for each action. We use a linear activation function 'relu' to
predict a wide range of values, to estimate action values in the RL model
'''
def call(self, x):
# first dense layer
x = self.dense1(x)
# second dense layer
x = self.dense2(x)
return self.output_layer(x)
class Agent:
# this constructor initializes the environment, model, memory, and other variables required for the agent
def __init__(self, env_string, num_actions, state_size, batch_size=32, learning_rate=0.01, gamma=0.98, epsilon=1.0, epsilon_decay=0.98, epsilon_min=0.01, tau=0.01, memory_capacity=10000):
self.env_string = env_string
self.env = gym.make(env_string)
self.env.reset()
self.state_size = state_size
self.num_actions = num_actions
self.action_scope = [i * 4.0 / (num_actions - 1) - 2.0 for i in range(num_actions)] # Adjusted to match action scaling in dueling DQN
self.batch_size = batch_size
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
self.tau = tau
self.memory = Memory(memory_capacity)
self.best_total_reward = float('-inf')
self.model = Net(state_size, num_actions)
self.target_model = Net(state_size, num_actions)
self.optimizer = Adam(learning_rate)
self.loss_fn = tf.losses.MeanSquaredError()
'''
Implements the epsilon-greedy policy. With probability epsilon, a random action is selected, otherwise the action chosen will
be the one with the highest Q value. The epsilon value is decayed over time to reduce the exploration as the agent learns.
It also converts the state to a tensor before passing through to the model.
'''
def select_action(self, state):
# if random number is less than epsilon, return random action
if np.random.rand() < self.epsilon:
return np.random.randint(self.num_actions), None
# convert to tensor
state = tf.convert_to_tensor([state], dtype=tf.float32)
q_values = self.model(state)
# else, return action with highest Q value
return np.argmax(q_values.numpy()), None
# Store experience tuple into the memory buffer, function is above
def store_transition(self, state, action, reward, next_state):
self.memory.update((state, action, reward, next_state, False))
'''
Learn function performs single step training on a batch of experiences sampled from the memory buffer.
'''
def learn(self):
if len(self.memory.memory) < self.batch_size:
return
# sample a batch of transitions from the memory
transitions = self.memory.sample(self.batch_size)
# extract the states, actions, rewards, next states, from the batch
state_batch, action_batch, reward_batch, next_state_batch, _ = zip(*transitions)
# convert the s, a, r, s_' to tensors
state_batch = tf.convert_to_tensor(state_batch, dtype=tf.float32)
action_batch = tf.convert_to_tensor(action_batch, dtype=tf.int32)
reward_batch = tf.convert_to_tensor(reward_batch, dtype=tf.float32)
next_state_batch = tf.convert_to_tensor(next_state_batch, dtype=tf.float32)
'''
Tensorflow GradientTape is an API for automatic differentiation. It records operations for automatic differentiation.
This function will calculate the predicted Q-values from the current state and actions.
'''
with tf.GradientTape() as tape:
'''
Forward pass through the DQN model to get the Q-values for all actions given the current batch of states.
q-values contains the predicted q-values for each action in each state of the batch
'''
q_values = self.model(state_batch)
'''
'action_indices' calculates the indices of the actions taken in the Q-value matrix.
Since q_values contains Q-values for all actions, this step is necessary to select only the Q-values corresponding to the actions that were actually taken.
'''
action_indices = tf.range(self.batch_size) * self.num_actions + action_batch
'''
Reshapes the Q-value matrix to a single vector, then selects the Q-values for the actions taken using the 'action_indices' calculated above.
'''
predicted_q = tf.gather(tf.reshape(q_values, [-1]), action_indices)
'''
Forward pass through the target DQN model to get Q-values for all actions given the next states.
'''
next_q_values = self.target_model(next_state_batch)
# Finds the maximum Q-value among all actions for each next state, which represents the best possible future reward achievable from the next state.
max_next_q = tf.reduce_max(next_q_values, axis=1)
'''
Calculates the target Q-values using the immediate reward received (reward_batch) and the
discounted maximum future reward (self.gamma * max_next_q). This forms the update target for the Q-value of the action taken.
'''
target_q = reward_batch + self.gamma * max_next_q
# Calculates the loss between the predicted Q-values and the target Q-values, basically MSE loss
loss = self.loss_fn(target_q, predicted_q)
# Calculate the gradients of the loss with respect to the model parameters
grads = tape.gradient(loss, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))
self.update_epsilon()
# update the epsilon value
def update_epsilon(self):
self.epsilon = max(self.epsilon_min, self.epsilon_decay * self.epsilon)
# update the model
def update_target_model(self):
main_weights = self.model.get_weights()
target_weights = self.target_model.get_weights()
for i in range(len(target_weights)):
target_weights[i] = self.tau * main_weights[i] + (1 - self.tau) * target_weights[i]
self.target_model.set_weights(target_weights)
# checks if current model is the best model based on total reward
def is_best_model(self, total_reward):
return total_reward > self.best_total_reward
# updates the best model with the current model
def update_best_model(self, total_reward):
self.best_total_reward = total_reward
# Main function to run the training
def main():
env_string = 'Pendulum-v0'
num_actions = 40
state_size = gym.make(env_string).observation_space.shape[0]
agent = Agent(env_string, num_actions, state_size)
episodes = 25
rewards = [] # Initialize rewards list to store total rewards per episode
for ep in range(episodes):
state = agent.env.reset()
done = False
total_reward = 0
# Capture frames for GIF
frames = []
while not done:
# Render the environment and capture frames
frames.append(agent.env.render(mode='rgb_array'))
action_index, _ = agent.select_action(state)
action = [agent.action_scope[action_index]]
next_state, reward, done, _ = agent.env.step(action)
agent.store_transition(state, action_index, reward, next_state)
state = next_state
total_reward += reward
agent.learn()
agent.update_target_model()
rewards.append(total_reward) # Append the total reward of the episode to the rewards list
print(f"Episode: {ep+1}, Total Reward: {total_reward:.2f}, Epsilon: {agent.epsilon:.2f}")
if agent.is_best_model(total_reward):
# Update and save the best model weights
agent.update_best_model(total_reward)
model_dir = 'Weights/DQN/decreaseEpisodes'
save_model_weights(agent, model_dir)
print("Best model weights saved.")
# Directory where you want to save the files
save_dir = 'training_animations/DQN/decreaseEpisodes'
# Create the directory if it doesn't exist
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Save frames as GIF
gif_filename = os.path.join(save_dir, f'episode_{ep+1}.gif')
imageio.mimsave(gif_filename, frames, duration=0.00333, loop=0) # Adjust duration as needed
# Close the environment
agent.env.close()
return rewards
rewards = main()
Episode: 1, Total Reward: -1136.59, Epsilon: 0.03 Best model weights saved. Episode: 2, Total Reward: -1595.20, Epsilon: 0.01 Episode: 3, Total Reward: -1697.06, Epsilon: 0.01 Episode: 4, Total Reward: -1356.38, Epsilon: 0.01 Episode: 5, Total Reward: -1379.86, Epsilon: 0.01 Episode: 6, Total Reward: -1160.54, Epsilon: 0.01 Episode: 7, Total Reward: -1424.49, Epsilon: 0.01 Episode: 8, Total Reward: -1018.19, Epsilon: 0.01 Best model weights saved. Episode: 9, Total Reward: -1282.38, Epsilon: 0.01 Episode: 10, Total Reward: -758.31, Epsilon: 0.01 Best model weights saved. Episode: 11, Total Reward: -1155.38, Epsilon: 0.01 Episode: 12, Total Reward: -505.12, Epsilon: 0.01 Best model weights saved. Episode: 13, Total Reward: -638.45, Epsilon: 0.01 Episode: 14, Total Reward: -320.07, Epsilon: 0.01 Best model weights saved. Episode: 15, Total Reward: -605.00, Epsilon: 0.01 Episode: 16, Total Reward: -2.22, Epsilon: 0.01 Best model weights saved. Episode: 17, Total Reward: -361.82, Epsilon: 0.01 Episode: 18, Total Reward: -356.10, Epsilon: 0.01 Episode: 19, Total Reward: -124.12, Epsilon: 0.01 Episode: 20, Total Reward: -377.44, Epsilon: 0.01 Episode: 21, Total Reward: -124.33, Epsilon: 0.01 Episode: 22, Total Reward: -122.28, Epsilon: 0.01 Episode: 23, Total Reward: -240.99, Epsilon: 0.01 Episode: 24, Total Reward: -497.55, Epsilon: 0.01 Episode: 25, Total Reward: -456.41, Epsilon: 0.01
plot_rewards(rewards)
compute_moving_average_and_plot(rewards)
GIF of highest reward attempt¶
Results¶
Similar to the previous models, the DQN model, even with a reduced number of training episodes, is able to balance the pendulum effectively. As seen from the GIF, when the pendulum is in the upright position, the DQN can maintain its balance and keep it steady. This suggests that the DQN has successfully learned the optimal policy for maintaining the pendulum's balance, indicating effective training and convergence of the model.
In terms of stability, the model is improving steadily, as indicated by the clear upward trend. However, there are still occasional decreases in rewards between episodes, likely due to the model exploring different strategies. Nonetheless, these decreases are not very drastic.
Finding the optimal value for Tau ¶
What is Tau¶
Previously, we used a soft update instead of a hard update for updating the target model.
The parameter tau plays a crucial role in soft updates. It determines the rate at which the target model is updated with the weights of the main model. Specifically, tau is a small positive value (typically much less than 1) that controls the extent to which the target model's weights are adjusted towards the main model's weights.
How Tau affects the stability of the model¶
A smaller tau value results in more gradual updates to the target model. This slow adjustment ensures that the target model doesn't drastically change its estimates of Q-values. As a result, the learning process becomes more stable because the agent's target Q-values change smoothly over time.
Large or abrupt updates to the target model (high tau) can cause oscillations in the learning process. These oscillations can make it difficult for the agent to converge to an optimal policy, as the Q-values might swing between overestimation and underestimation.
What will we be doing¶
Previous value of tau: 0.01
We will be looping through different tau values to evaluate their impact on the stability and performance of our model
Values of tau that we will be trying: [0.005, 0.01, 0.05]
References
class Memory:
# Constructor, the capacity is the maximum size of the memory. Once capacity is reached, the old memories are removed.
def __init__(self, capacity):
self.memory = deque(maxlen=capacity)
'''
Adds transition to the memory. The transition is a tuple of (state, action, reward, next_state).
If the maximum capacity of memory is reached, the old memories are overrided.
'''
def update(self, transition):
self.memory.append(transition)
'''
Retrieve a random sample from memory, batch size indicates the number of samples to be retrieved.
'''
def sample(self, batch_size):
return random.sample(self.memory, batch_size)
class Net(tf.keras.Model):
# constructor initializes the layers
def __init__(self, input_size, num_actions):
super(Net, self).__init__()
self.dense1 = Dense(64, activation='relu', input_shape=(input_size,))
self.dense2 = Dense(64, activation='relu')
self.output_layer = Dense(num_actions, activation='linear')
'''
This function is the forward pass of the model. It takes the state as input and returns the Q values for each action.
'x' is the input to the model, which is the state from the environment. 'x' is then passed through the 2 dense
layers and the output layer to get the predicted values for each action. We use a linear activation function 'relu' to
predict a wide range of values, to estimate action values in the RL model
'''
def call(self, x):
# first dense layer
x = self.dense1(x)
# second dense layer
x = self.dense2(x)
return self.output_layer(x)
class Agent:
# this constructor initializes the environment, model, memory, and other variables required for the agent
def __init__(self, env_string, num_actions, state_size, batch_size=32, learning_rate=0.01, gamma=0.98, epsilon=1.0, epsilon_decay=0.98, epsilon_min=0.01, tau=0.01, memory_capacity=10000):
self.env_string = env_string
self.env = gym.make(env_string)
self.env.reset()
self.state_size = state_size
self.num_actions = num_actions
self.action_scope = [i * 4.0 / (num_actions - 1) - 2.0 for i in range(num_actions)] # Adjusted to match action scaling in dueling DQN
self.batch_size = batch_size
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
self.tau = tau
self.memory = Memory(memory_capacity)
self.best_total_reward = float('-inf')
self.model = Net(state_size, num_actions)
self.target_model = Net(state_size, num_actions)
self.optimizer = Adam(learning_rate)
self.loss_fn = tf.losses.MeanSquaredError()
'''
Implements the epsilon-greedy policy. With probability epsilon, a random action is selected, otherwise the action chosen will
be the one with the highest Q value. The epsilon value is decayed over time to reduce the exploration as the agent learns.
It also converts the state to a tensor before passing through to the model.
'''
def select_action(self, state):
# if random number is less than epsilon, return random action
if np.random.rand() < self.epsilon:
return np.random.randint(self.num_actions), None
# convert to tensor
state = tf.convert_to_tensor([state], dtype=tf.float32)
q_values = self.model(state)
# else, return action with highest Q value
return np.argmax(q_values.numpy()), None
# Store experience tuple into the memory buffer, function is above
def store_transition(self, state, action, reward, next_state):
self.memory.update((state, action, reward, next_state, False))
'''
Learn function performs single step training on a batch of experiences sampled from the memory buffer.
'''
def learn(self):
if len(self.memory.memory) < self.batch_size:
return
# sample a batch of transitions from the memory
transitions = self.memory.sample(self.batch_size)
# extract the states, actions, rewards, next states, from the batch
state_batch, action_batch, reward_batch, next_state_batch, _ = zip(*transitions)
# convert the s, a, r, s_' to tensors
state_batch = tf.convert_to_tensor(state_batch, dtype=tf.float32)
action_batch = tf.convert_to_tensor(action_batch, dtype=tf.int32)
reward_batch = tf.convert_to_tensor(reward_batch, dtype=tf.float32)
next_state_batch = tf.convert_to_tensor(next_state_batch, dtype=tf.float32)
'''
Tensorflow GradientTape is an API for automatic differentiation. It records operations for automatic differentiation.
This function will calculate the predicted Q-values from the current state and actions.
'''
with tf.GradientTape() as tape:
'''
Forward pass through the DQN model to get the Q-values for all actions given the current batch of states.
q-values contains the predicted q-values for each action in each state of the batch
'''
q_values = self.model(state_batch)
'''
'action_indices' calculates the indices of the actions taken in the Q-value matrix.
Since q_values contains Q-values for all actions, this step is necessary to select only the Q-values corresponding to the actions that were actually taken.
'''
action_indices = tf.range(self.batch_size) * self.num_actions + action_batch
'''
Reshapes the Q-value matrix to a single vector, then selects the Q-values for the actions taken using the 'action_indices' calculated above.
'''
predicted_q = tf.gather(tf.reshape(q_values, [-1]), action_indices)
'''
Forward pass through the target DQN model to get Q-values for all actions given the next states.
'''
next_q_values = self.target_model(next_state_batch)
# Finds the maximum Q-value among all actions for each next state, which represents the best possible future reward achievable from the next state.
max_next_q = tf.reduce_max(next_q_values, axis=1)
'''
Calculates the target Q-values using the immediate reward received (reward_batch) and the
discounted maximum future reward (self.gamma * max_next_q). This forms the update target for the Q-value of the action taken.
'''
target_q = reward_batch + self.gamma * max_next_q
# Calculates the loss between the predicted Q-values and the target Q-values, basically MSE loss
loss = self.loss_fn(target_q, predicted_q)
# Calculate the gradients of the loss with respect to the model parameters
grads = tape.gradient(loss, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))
self.update_epsilon()
# update the epsilon value
def update_epsilon(self):
self.epsilon = max(self.epsilon_min, self.epsilon_decay * self.epsilon)
# update the model
def update_target_model(self):
main_weights = self.model.get_weights()
target_weights = self.target_model.get_weights()
for i in range(len(target_weights)):
target_weights[i] = self.tau * main_weights[i] + (1 - self.tau) * target_weights[i]
self.target_model.set_weights(target_weights)
# checks if current model is the best model based on total reward
def is_best_model(self, total_reward):
return total_reward > self.best_total_reward
# updates the best model with the current model
def update_best_model(self, total_reward):
self.best_total_reward = total_reward
# Main function to run the training
def main(tau):
env_string = 'Pendulum-v0'
num_actions = 40
tau = tau
state_size = gym.make(env_string).observation_space.shape[0]
agent = Agent(env_string, num_actions, state_size, tau = tau)
episodes = 25
rewards = [] # Initialize rewards list to store total rewards per episode
for ep in range(episodes):
state = agent.env.reset()
done = False
total_reward = 0
# Capture frames for GIF
frames = []
while not done:
# Render the environment and capture frames
frames.append(agent.env.render(mode='rgb_array'))
action_index, _ = agent.select_action(state)
action = [agent.action_scope[action_index]]
next_state, reward, done, _ = agent.env.step(action)
agent.store_transition(state, action_index, reward, next_state)
state = next_state
total_reward += reward
agent.learn()
agent.update_target_model()
rewards.append(total_reward) # Append the total reward of the episode to the rewards list
print(f"Episode: {ep+1}, Total Reward: {total_reward:.2f}, Epsilon: {agent.epsilon:.2f}")
if agent.is_best_model(total_reward):
# Update and save the best model weights
agent.update_best_model(total_reward)
model_dir = f'Weights/DQN/Tau{tau}'
save_model_weights(agent, model_dir)
print("Best model weights saved.")
# Directory where you want to save the files
save_dir = f'training_animations/DQN/Tau{tau}'
# Create the directory if it doesn't exist
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Save frames as GIF
gif_filename = os.path.join(save_dir, f'episode_{ep+1}.gif')
imageio.mimsave(gif_filename, frames, duration=0.00333, loop=0) # Adjust duration as needed
# Close the environment
agent.env.close()
return rewards
tau_values = [0.005, 0.01, 0.05]
rewards_tau = []
for tau in tau_values:
rewards = main(tau = tau)
rewards_tau.append(rewards)
Episode: 1, Total Reward: -1474.09, Epsilon: 0.03 Best model weights saved. Episode: 2, Total Reward: -1009.00, Epsilon: 0.01 Best model weights saved. Episode: 3, Total Reward: -1575.66, Epsilon: 0.01 Episode: 4, Total Reward: -1374.86, Epsilon: 0.01 Episode: 5, Total Reward: -1089.26, Epsilon: 0.01 Episode: 6, Total Reward: -1583.16, Epsilon: 0.01 Episode: 7, Total Reward: -1303.44, Epsilon: 0.01 Episode: 8, Total Reward: -937.42, Epsilon: 0.01 Best model weights saved. Episode: 9, Total Reward: -1075.38, Epsilon: 0.01 Episode: 10, Total Reward: -1231.00, Epsilon: 0.01 Episode: 11, Total Reward: -1199.26, Epsilon: 0.01 Episode: 12, Total Reward: -889.74, Epsilon: 0.01 Best model weights saved. Episode: 13, Total Reward: -784.58, Epsilon: 0.01 Best model weights saved. Episode: 14, Total Reward: -1132.56, Epsilon: 0.01 Episode: 15, Total Reward: -500.52, Epsilon: 0.01 Best model weights saved. Episode: 16, Total Reward: -506.87, Epsilon: 0.01 Episode: 17, Total Reward: -384.41, Epsilon: 0.01 Best model weights saved. Episode: 18, Total Reward: -510.72, Epsilon: 0.01 Episode: 19, Total Reward: -250.39, Epsilon: 0.01 Best model weights saved. Episode: 20, Total Reward: -121.58, Epsilon: 0.01 Best model weights saved. Episode: 21, Total Reward: -694.78, Epsilon: 0.01 Episode: 22, Total Reward: -259.50, Epsilon: 0.01 Episode: 23, Total Reward: -126.60, Epsilon: 0.01 Episode: 24, Total Reward: -121.30, Epsilon: 0.01 Best model weights saved. Episode: 25, Total Reward: -121.93, Epsilon: 0.01 Episode: 1, Total Reward: -1067.07, Epsilon: 0.03 Best model weights saved. Episode: 2, Total Reward: -1407.53, Epsilon: 0.01 Episode: 3, Total Reward: -1708.22, Epsilon: 0.01 Episode: 4, Total Reward: -1507.92, Epsilon: 0.01 Episode: 5, Total Reward: -907.53, Epsilon: 0.01 Best model weights saved. Episode: 6, Total Reward: -921.53, Epsilon: 0.01 Episode: 7, Total Reward: -819.63, Epsilon: 0.01 Best model weights saved. Episode: 8, Total Reward: -1190.52, Epsilon: 0.01 Episode: 9, Total Reward: -971.26, Epsilon: 0.01 Episode: 10, Total Reward: -259.95, Epsilon: 0.01 Best model weights saved. Episode: 11, Total Reward: -633.50, Epsilon: 0.01 Episode: 12, Total Reward: -127.48, Epsilon: 0.01 Best model weights saved. Episode: 13, Total Reward: -126.54, Epsilon: 0.01 Best model weights saved. Episode: 14, Total Reward: -834.58, Epsilon: 0.01 Episode: 15, Total Reward: -251.63, Epsilon: 0.01 Episode: 16, Total Reward: -352.08, Epsilon: 0.01 Episode: 17, Total Reward: -549.69, Epsilon: 0.01 Episode: 18, Total Reward: -129.36, Epsilon: 0.01 Episode: 19, Total Reward: -128.23, Epsilon: 0.01 Episode: 20, Total Reward: -128.37, Epsilon: 0.01 Episode: 21, Total Reward: -225.57, Epsilon: 0.01 Episode: 22, Total Reward: -127.54, Epsilon: 0.01 Episode: 23, Total Reward: -232.50, Epsilon: 0.01 Episode: 24, Total Reward: -128.00, Epsilon: 0.01 Episode: 25, Total Reward: -571.90, Epsilon: 0.01 Episode: 1, Total Reward: -975.55, Epsilon: 0.03 Best model weights saved. Episode: 2, Total Reward: -1332.39, Epsilon: 0.01 Episode: 3, Total Reward: -1189.33, Epsilon: 0.01 Episode: 4, Total Reward: -1227.98, Epsilon: 0.01 Episode: 5, Total Reward: -1015.27, Epsilon: 0.01 Episode: 6, Total Reward: -1117.52, Epsilon: 0.01 Episode: 7, Total Reward: -836.50, Epsilon: 0.01 Best model weights saved. Episode: 8, Total Reward: -1053.56, Epsilon: 0.01 Episode: 9, Total Reward: -523.04, Epsilon: 0.01 Best model weights saved. Episode: 10, Total Reward: -1027.64, Epsilon: 0.01 Episode: 11, Total Reward: -228.22, Epsilon: 0.01 Best model weights saved. Episode: 12, Total Reward: -774.46, Epsilon: 0.01 Episode: 13, Total Reward: -499.18, Epsilon: 0.01 Episode: 14, Total Reward: -122.54, Epsilon: 0.01 Best model weights saved. Episode: 15, Total Reward: -496.88, Epsilon: 0.01 Episode: 16, Total Reward: -472.66, Epsilon: 0.01 Episode: 17, Total Reward: -744.83, Epsilon: 0.01 Episode: 18, Total Reward: -744.94, Epsilon: 0.01 Episode: 19, Total Reward: -362.49, Epsilon: 0.01 Episode: 20, Total Reward: -255.26, Epsilon: 0.01 Episode: 21, Total Reward: -254.17, Epsilon: 0.01 Episode: 22, Total Reward: -117.79, Epsilon: 0.01 Best model weights saved. Episode: 23, Total Reward: -470.91, Epsilon: 0.01 Episode: 24, Total Reward: -638.35, Epsilon: 0.01 Episode: 25, Total Reward: -1350.34, Epsilon: 0.01
plot_rewards_subplots(rewards_tau, tau_values, 'Tau')
compute_moving_average_and_plot_subplots(rewards_tau, tau_values, 'Tau')
GIF of highest reward attempt¶
Tau = 0.005¶
Tau = 0.01¶
Tau = 0.05¶
Results¶
Similarly with our previous models, all the values of Tau we tried allowed the DQN to balance the pendulum. As shown in the GIF, the DQN can maintain and keep the pendulum steady in the upright position. This indicates that the DQN has successfully learned the optimal policy for balancing the pendulum, demonstrating effective training and model convergence.
In terms of stability, the model with a Tau value of 0.01 is the most stable. This is evident from the rewards and moving average graph, where it exhibits the fewest violent spikes.
The model with a Tau value of 0.005 shows steady improvement, but displays instability in the earlier episodes, as indicated by the spikes.
The model with a Tau value of 0.05 has fewer spikes in the initial episodes, but there is a sharp decrease in rewards towards the end of the training.
Overall, the model with a Tau value of 0.01 is the most stable, with minimal spikes at the beginning of the episodes and consistent high rewards at the end of training.
We will use a Tau value of 0.01 for future models.
Increase the number of Dense Neurons ¶
What changed from the previous model¶
- Number of dense neurons increased from 64 to 128
Why Increase the number of dense neurons¶
We increased the number of dense neurons to further stabilize the training process. A network with more neurons can capture complex relationships in the data more effectively, which allows for better approximation of Q-functions. This enhanced approximation helps the agent make more accurate decisions, resulting in more stable learning.
class Memory:
# Constructor, the capacity is the maximum size of the memory. Once capacity is reached, the old memories are removed.
def __init__(self, capacity):
self.memory = deque(maxlen=capacity)
'''
Adds transition to the memory. The transition is a tuple of (state, action, reward, next_state).
If the maximum capacity of memory is reached, the old memories are overrided.
'''
def update(self, transition):
self.memory.append(transition)
'''
Retrieve a random sample from memory, batch size indicates the number of samples to be retrieved.
'''
def sample(self, batch_size):
return random.sample(self.memory, batch_size)
class Net(tf.keras.Model):
# constructor initializes the layers
def __init__(self, input_size, num_actions):
super(Net, self).__init__()
self.dense1 = Dense(128, activation='relu', input_shape=(input_size,))
self.dense2 = Dense(128, activation='relu')
self.output_layer = Dense(num_actions, activation='linear')
'''
This function is the forward pass of the model. It takes the state as input and returns the Q values for each action.
'x' is the input to the model, which is the state from the environment. 'x' is then passed through the 2 dense
layers and the output layer to get the predicted values for each action. We use a linear activation function 'relu' to
predict a wide range of values, to estimate action values in the RL model
'''
def call(self, x):
# first dense layer
x = self.dense1(x)
# second dense layer
x = self.dense2(x)
return self.output_layer(x)
class Agent:
# this constructor initializes the environment, model, memory, and other variables required for the agent
def __init__(self, env_string, num_actions, state_size, batch_size=32, learning_rate=0.01, gamma=0.98, epsilon=1.0, epsilon_decay=0.98, epsilon_min=0.01, tau=0.01, memory_capacity=10000):
self.env_string = env_string
self.env = gym.make(env_string)
self.env.reset()
self.state_size = state_size
self.num_actions = num_actions
self.action_scope = [i * 4.0 / (num_actions - 1) - 2.0 for i in range(num_actions)] # Adjusted to match action scaling in dueling DQN
self.batch_size = batch_size
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
self.tau = tau
self.memory = Memory(memory_capacity)
self.best_total_reward = float('-inf')
self.model = Net(state_size, num_actions)
self.target_model = Net(state_size, num_actions)
self.optimizer = Adam(learning_rate)
self.loss_fn = tf.losses.MeanSquaredError()
'''
Implements the epsilon-greedy policy. With probability epsilon, a random action is selected, otherwise the action chosen will
be the one with the highest Q value. The epsilon value is decayed over time to reduce the exploration as the agent learns.
It also converts the state to a tensor before passing through to the model.
'''
def select_action(self, state):
# if random number is less than epsilon, return random action
if np.random.rand() < self.epsilon:
return np.random.randint(self.num_actions), None
# convert to tensor
state = tf.convert_to_tensor([state], dtype=tf.float32)
q_values = self.model(state)
# else, return action with highest Q value
return np.argmax(q_values.numpy()), None
# Store experience tuple into the memory buffer, function is above
def store_transition(self, state, action, reward, next_state):
self.memory.update((state, action, reward, next_state, False))
'''
Learn function performs single step training on a batch of experiences sampled from the memory buffer.
'''
def learn(self):
if len(self.memory.memory) < self.batch_size:
return
# sample a batch of transitions from the memory
transitions = self.memory.sample(self.batch_size)
# extract the states, actions, rewards, next states, from the batch
state_batch, action_batch, reward_batch, next_state_batch, _ = zip(*transitions)
# convert the s, a, r, s_' to tensors
state_batch = tf.convert_to_tensor(state_batch, dtype=tf.float32)
action_batch = tf.convert_to_tensor(action_batch, dtype=tf.int32)
reward_batch = tf.convert_to_tensor(reward_batch, dtype=tf.float32)
next_state_batch = tf.convert_to_tensor(next_state_batch, dtype=tf.float32)
'''
Tensorflow GradientTape is an API for automatic differentiation. It records operations for automatic differentiation.
This function will calculate the predicted Q-values from the current state and actions.
'''
with tf.GradientTape() as tape:
'''
Forward pass through the DQN model to get the Q-values for all actions given the current batch of states.
q-values contains the predicted q-values for each action in each state of the batch
'''
q_values = self.model(state_batch)
'''
'action_indices' calculates the indices of the actions taken in the Q-value matrix.
Since q_values contains Q-values for all actions, this step is necessary to select only the Q-values corresponding to the actions that were actually taken.
'''
action_indices = tf.range(self.batch_size) * self.num_actions + action_batch
'''
Reshapes the Q-value matrix to a single vector, then selects the Q-values for the actions taken using the 'action_indices' calculated above.
'''
predicted_q = tf.gather(tf.reshape(q_values, [-1]), action_indices)
'''
Forward pass through the target DQN model to get Q-values for all actions given the next states.
'''
next_q_values = self.target_model(next_state_batch)
# Finds the maximum Q-value among all actions for each next state, which represents the best possible future reward achievable from the next state.
max_next_q = tf.reduce_max(next_q_values, axis=1)
'''
Calculates the target Q-values using the immediate reward received (reward_batch) and the
discounted maximum future reward (self.gamma * max_next_q). This forms the update target for the Q-value of the action taken.
'''
target_q = reward_batch + self.gamma * max_next_q
# Calculates the loss between the predicted Q-values and the target Q-values, basically MSE loss
loss = self.loss_fn(target_q, predicted_q)
# Calculate the gradients of the loss with respect to the model parameters
grads = tape.gradient(loss, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))
self.update_epsilon()
# update the epsilon value
def update_epsilon(self):
self.epsilon = max(self.epsilon_min, self.epsilon_decay * self.epsilon)
# update the model
def update_target_model(self):
main_weights = self.model.get_weights()
target_weights = self.target_model.get_weights()
for i in range(len(target_weights)):
target_weights[i] = self.tau * main_weights[i] + (1 - self.tau) * target_weights[i]
self.target_model.set_weights(target_weights)
# checks if current model is the best model based on total reward
def is_best_model(self, total_reward):
return total_reward > self.best_total_reward
# updates the best model with the current model
def update_best_model(self, total_reward):
self.best_total_reward = total_reward
# Main function to run the training
def main():
env_string = 'Pendulum-v0'
num_actions = 40
state_size = gym.make(env_string).observation_space.shape[0]
agent = Agent(env_string, num_actions, state_size)
episodes = 25
rewards = [] # Initialize rewards list to store total rewards per episode
for ep in range(episodes):
state = agent.env.reset()
done = False
total_reward = 0
# Capture frames for GIF
frames = []
while not done:
# Render the environment and capture frames
frames.append(agent.env.render(mode='rgb_array'))
action_index, _ = agent.select_action(state)
action = [agent.action_scope[action_index]]
next_state, reward, done, _ = agent.env.step(action)
agent.store_transition(state, action_index, reward, next_state)
state = next_state
total_reward += reward
agent.learn()
agent.update_target_model()
rewards.append(total_reward) # Append the total reward of the episode to the rewards list
print(f"Episode: {ep+1}, Total Reward: {total_reward:.2f}, Epsilon: {agent.epsilon:.2f}")
if agent.is_best_model(total_reward):
# Update and save the best model weights
agent.update_best_model(total_reward)
model_dir = 'Weights/DQN/increaseDense'
save_model_weights(agent, model_dir)
print("Best model weights saved.")
# Directory where you want to save the files
save_dir = 'training_animations/DQN/increaseDense'
# Create the directory if it doesn't exist
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Save frames as GIF
gif_filename = os.path.join(save_dir, f'episode_{ep+1}.gif')
imageio.mimsave(gif_filename, frames, duration=0.00333, loop=0) # Adjust duration as needed
agent.env.close()
return rewards
rewards = main()
Episode: 1, Total Reward: -859.01, Epsilon: 0.03 Best model weights saved. Episode: 2, Total Reward: -1340.84, Epsilon: 0.01 Episode: 3, Total Reward: -1253.62, Epsilon: 0.01 Episode: 4, Total Reward: -1447.68, Epsilon: 0.01 Episode: 5, Total Reward: -1122.81, Epsilon: 0.01 Episode: 6, Total Reward: -638.20, Epsilon: 0.01 Best model weights saved. Episode: 7, Total Reward: -528.61, Epsilon: 0.01 Best model weights saved. Episode: 8, Total Reward: -651.26, Epsilon: 0.01 Episode: 9, Total Reward: -1230.46, Epsilon: 0.01 Episode: 10, Total Reward: -1354.16, Epsilon: 0.01 Episode: 11, Total Reward: -1205.33, Epsilon: 0.01 Episode: 12, Total Reward: -523.12, Epsilon: 0.01 Best model weights saved. Episode: 13, Total Reward: -455.71, Epsilon: 0.01 Best model weights saved. Episode: 14, Total Reward: -748.52, Epsilon: 0.01 Episode: 15, Total Reward: -379.53, Epsilon: 0.01 Best model weights saved. Episode: 16, Total Reward: -132.11, Epsilon: 0.01 Best model weights saved. Episode: 17, Total Reward: -6.38, Epsilon: 0.01 Best model weights saved. Episode: 18, Total Reward: -6.51, Epsilon: 0.01 Episode: 19, Total Reward: -134.06, Epsilon: 0.01 Episode: 20, Total Reward: -3.42, Epsilon: 0.01 Best model weights saved. Episode: 21, Total Reward: -370.48, Epsilon: 0.01 Episode: 22, Total Reward: -380.46, Epsilon: 0.01 Episode: 23, Total Reward: -373.02, Epsilon: 0.01 Episode: 24, Total Reward: -1286.08, Epsilon: 0.01 Episode: 25, Total Reward: -1420.55, Epsilon: 0.01
plot_rewards(rewards)
compute_moving_average_and_plot(rewards)
GIF of highest reward attempt¶
Results¶
The model, after increasing the number of dense units, is able to balance the pendulum. As seen in the GIF, when the pendulum is in the upright position, the DQN can maintain its balance and keep it steady. This suggests that the DQN has successfully learned the optimal policy for balancing the pendulum, indicating effective training and model convergence.
Additionally, the model shows improvement in terms of stability. Compared to the previous model (Decrease the number of Episodes]), the reward and moving average curves exhibit fewer and less violent spikes. This suggests that the current model achieves better stability and smoother training.
Finding the optimal number of actions ¶
What is Number of Actions¶
The "number of actions" refers to the size of the action space in a reinforcement learning environment. It represents the total number of distinct actions an agent can choose from at any given time.
Since we discretize the number of actions in the Pendulum-v0 environment, the agent has a finite set of actions to select from, making the decision-making process more manageable and allowing for more structured learning.
How does number of actions affect stability of the model¶
With a larger action space, the agent has more choices to explore. This can make it challenging to efficiently explore and learn the optimal policy, as the agent may require more interactions with the environment to understand the consequences of each action. This can lead to instability, especially if the exploration strategy is not well-tuned.
A smaller action space reduces the complexity of exploration, making it easier for the agent to learn effective policies. However, it might also limit the agent's ability to find the optimal policy, particularly in environments where fine-grained control is necessary.
What will we be doing¶
Initial number of actions: 40
We will be testing different numbers of actions to determine which one provides the most stable training
Values of number of action we will be trying: [3, 5, 21]
References
class Memory:
# Constructor, the capacity is the maximum size of the memory. Once capacity is reached, the old memories are removed.
def __init__(self, capacity):
self.memory = deque(maxlen=capacity)
'''
Adds transition to the memory. The transition is a tuple of (state, action, reward, next_state).
If the maximum capacity of memory is reached, the old memories are overrided.
'''
def update(self, transition):
self.memory.append(transition)
'''
Retrieve a random sample from memory, batch size indicates the number of samples to be retrieved.
'''
def sample(self, batch_size):
return random.sample(self.memory, batch_size)
class Net(tf.keras.Model):
# constructor initializes the layers
def __init__(self, input_size, num_actions):
super(Net, self).__init__()
self.dense1 = Dense(128, activation='relu', input_shape=(input_size,))
self.dense2 = Dense(128, activation='relu')
self.output_layer = Dense(num_actions, activation='linear')
'''
This function is the forward pass of the model. It takes the state as input and returns the Q values for each action.
'x' is the input to the model, which is the state from the environment. 'x' is then passed through the 2 dense
layers and the output layer to get the predicted values for each action. We use a linear activation function 'relu' to
predict a wide range of values, to estimate action values in the RL model
'''
def call(self, x):
# first dense layer
x = self.dense1(x)
# second dense layer
x = self.dense2(x)
return self.output_layer(x)
class Agent:
# this constructor initializes the environment, model, memory, and other variables required for the agent
def __init__(self, env_string, num_actions, state_size, batch_size=32, learning_rate=0.01, gamma=0.98, epsilon=1.0, epsilon_decay=0.98, epsilon_min=0.01, tau=0.01, memory_capacity=10000):
self.env_string = env_string
self.env = gym.make(env_string)
self.env.reset()
self.state_size = state_size
self.num_actions = num_actions
self.action_scope = [i * 4.0 / (num_actions - 1) - 2.0 for i in range(num_actions)] # Adjusted to match action scaling in dueling DQN
self.batch_size = batch_size
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
self.tau = tau
self.memory = Memory(memory_capacity)
self.best_total_reward = float('-inf')
self.model = Net(state_size, num_actions)
self.target_model = Net(state_size, num_actions)
self.optimizer = Adam(learning_rate)
self.loss_fn = tf.losses.MeanSquaredError()
'''
Implements the epsilon-greedy policy. With probability epsilon, a random action is selected, otherwise the action chosen will
be the one with the highest Q value. The epsilon value is decayed over time to reduce the exploration as the agent learns.
It also converts the state to a tensor before passing through to the model.
'''
def select_action(self, state):
# if random number is less than epsilon, return random action
if np.random.rand() < self.epsilon:
return np.random.randint(self.num_actions), None
# convert to tensor
state = tf.convert_to_tensor([state], dtype=tf.float32)
q_values = self.model(state)
# else, return action with highest Q value
return np.argmax(q_values.numpy()), None
# Store experience tuple into the memory buffer, function is above
def store_transition(self, state, action, reward, next_state):
self.memory.update((state, action, reward, next_state, False))
'''
Learn function performs single step training on a batch of experiences sampled from the memory buffer.
'''
def learn(self):
if len(self.memory.memory) < self.batch_size:
return
# sample a batch of transitions from the memory
transitions = self.memory.sample(self.batch_size)
# extract the states, actions, rewards, next states, from the batch
state_batch, action_batch, reward_batch, next_state_batch, _ = zip(*transitions)
# convert the s, a, r, s_' to tensors
state_batch = tf.convert_to_tensor(state_batch, dtype=tf.float32)
action_batch = tf.convert_to_tensor(action_batch, dtype=tf.int32)
reward_batch = tf.convert_to_tensor(reward_batch, dtype=tf.float32)
next_state_batch = tf.convert_to_tensor(next_state_batch, dtype=tf.float32)
'''
Tensorflow GradientTape is an API for automatic differentiation. It records operations for automatic differentiation.
This function will calculate the predicted Q-values from the current state and actions.
'''
with tf.GradientTape() as tape:
'''
Forward pass through the DQN model to get the Q-values for all actions given the current batch of states.
q-values contains the predicted q-values for each action in each state of the batch
'''
q_values = self.model(state_batch)
'''
'action_indices' calculates the indices of the actions taken in the Q-value matrix.
Since q_values contains Q-values for all actions, this step is necessary to select only the Q-values corresponding to the actions that were actually taken.
'''
action_indices = tf.range(self.batch_size) * self.num_actions + action_batch
'''
Reshapes the Q-value matrix to a single vector, then selects the Q-values for the actions taken using the 'action_indices' calculated above.
'''
predicted_q = tf.gather(tf.reshape(q_values, [-1]), action_indices)
'''
Forward pass through the target DQN model to get Q-values for all actions given the next states.
'''
next_q_values = self.target_model(next_state_batch)
# Finds the maximum Q-value among all actions for each next state, which represents the best possible future reward achievable from the next state.
max_next_q = tf.reduce_max(next_q_values, axis=1)
'''
Calculates the target Q-values using the immediate reward received (reward_batch) and the
discounted maximum future reward (self.gamma * max_next_q). This forms the update target for the Q-value of the action taken.
'''
target_q = reward_batch + self.gamma * max_next_q
# Calculates the loss between the predicted Q-values and the target Q-values, basically MSE loss
loss = self.loss_fn(target_q, predicted_q)
# Calculate the gradients of the loss with respect to the model parameters
grads = tape.gradient(loss, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))
self.update_epsilon()
# update the epsilon value
def update_epsilon(self):
self.epsilon = max(self.epsilon_min, self.epsilon_decay * self.epsilon)
# update the model
def update_target_model(self):
main_weights = self.model.get_weights()
target_weights = self.target_model.get_weights()
for i in range(len(target_weights)):
target_weights[i] = self.tau * main_weights[i] + (1 - self.tau) * target_weights[i]
self.target_model.set_weights(target_weights)
# checks if current model is the best model based on total reward
def is_best_model(self, total_reward):
return total_reward > self.best_total_reward
# updates the best model with the current model
def update_best_model(self, total_reward):
self.best_total_reward = total_reward
# Main function to run the training
def main(num_actions):
env_string = 'Pendulum-v0'
num_actions = num_actions
state_size = gym.make(env_string).observation_space.shape[0]
agent = Agent(env_string, num_actions, state_size)
episodes = 25
rewards = [] # Initialize rewards list to store total rewards per episode
for ep in range(episodes):
state = agent.env.reset()
done = False
total_reward = 0
# Capture frames for GIF
frames = []
while not done:
# Render the environment and capture frames
frames.append(agent.env.render(mode='rgb_array'))
action_index, _ = agent.select_action(state)
action = [agent.action_scope[action_index]]
next_state, reward, done, _ = agent.env.step(action)
agent.store_transition(state, action_index, reward, next_state)
state = next_state
total_reward += reward
agent.learn()
agent.update_target_model()
rewards.append(total_reward) # Append the total reward of the episode to the rewards list
print(f"Episode: {ep+1}, Total Reward: {total_reward:.2f}, Epsilon: {agent.epsilon:.2f}")
if agent.is_best_model(total_reward):
# Update and save the best model weights
agent.update_best_model(total_reward)
model_dir = f'Weights/DQN/numActions{num_actions}'
save_model_weights(agent, model_dir)
print("Best model weights saved.")
# Directory where you want to save the files
save_dir = f'training_animations/DQN/numActions{num_actions}'
# Create the directory if it doesn't exist
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Save frames as GIF
gif_filename = os.path.join(save_dir, f'episode_{ep+1}.gif')
imageio.mimsave(gif_filename, frames, duration=0.00333, loop=0) # Adjust duration as needed
agent.env.close()
return rewards
Looping through different number of actions to find the most optimal one¶
num_actions_test = [3, 5, 21]
rewards_num_actions = []
for num_actions in num_actions_test:
# Run the training with the number of actions
rewards = main(num_actions)
# Store the rewards for the current num_actions
rewards_num_actions.append(rewards)
Episode: 1, Total Reward: -1083.81, Epsilon: 0.03 Best model weights saved. Episode: 2, Total Reward: -1532.71, Epsilon: 0.01 Episode: 3, Total Reward: -1545.18, Epsilon: 0.01 Episode: 4, Total Reward: -974.98, Epsilon: 0.01 Best model weights saved. Episode: 5, Total Reward: -665.28, Epsilon: 0.01 Best model weights saved. Episode: 6, Total Reward: -580.26, Epsilon: 0.01 Best model weights saved. Episode: 7, Total Reward: -265.74, Epsilon: 0.01 Best model weights saved. Episode: 8, Total Reward: -671.50, Epsilon: 0.01 Episode: 9, Total Reward: -492.00, Epsilon: 0.01 Episode: 10, Total Reward: -446.16, Epsilon: 0.01 Episode: 11, Total Reward: -351.09, Epsilon: 0.01 Episode: 12, Total Reward: -3.45, Epsilon: 0.01 Best model weights saved. Episode: 13, Total Reward: -4.73, Epsilon: 0.01 Episode: 14, Total Reward: -443.36, Epsilon: 0.01 Episode: 15, Total Reward: -126.42, Epsilon: 0.01 Episode: 16, Total Reward: -9.17, Epsilon: 0.01 Episode: 17, Total Reward: -131.46, Epsilon: 0.01 Episode: 18, Total Reward: -125.20, Epsilon: 0.01 Episode: 19, Total Reward: -296.94, Epsilon: 0.01 Episode: 20, Total Reward: -129.33, Epsilon: 0.01 Episode: 21, Total Reward: -133.25, Epsilon: 0.01 Episode: 22, Total Reward: -126.24, Epsilon: 0.01 Episode: 23, Total Reward: -127.68, Epsilon: 0.01 Episode: 24, Total Reward: -135.92, Epsilon: 0.01 Episode: 25, Total Reward: -128.24, Epsilon: 0.01 Episode: 1, Total Reward: -1572.41, Epsilon: 0.03 Best model weights saved. Episode: 2, Total Reward: -1522.96, Epsilon: 0.01 Best model weights saved. Episode: 3, Total Reward: -1451.41, Epsilon: 0.01 Best model weights saved. Episode: 4, Total Reward: -1325.68, Epsilon: 0.01 Best model weights saved. Episode: 5, Total Reward: -1140.72, Epsilon: 0.01 Best model weights saved. Episode: 6, Total Reward: -1377.76, Epsilon: 0.01 Episode: 7, Total Reward: -714.46, Epsilon: 0.01 Best model weights saved. Episode: 8, Total Reward: -265.74, Epsilon: 0.01 Best model weights saved. Episode: 9, Total Reward: -130.88, Epsilon: 0.01 Best model weights saved. Episode: 10, Total Reward: -258.35, Epsilon: 0.01 Episode: 11, Total Reward: -244.18, Epsilon: 0.01 Episode: 12, Total Reward: -259.13, Epsilon: 0.01 Episode: 13, Total Reward: -126.55, Epsilon: 0.01 Best model weights saved. Episode: 14, Total Reward: -381.13, Epsilon: 0.01 Episode: 15, Total Reward: -250.87, Epsilon: 0.01 Episode: 16, Total Reward: -4.01, Epsilon: 0.01 Best model weights saved. Episode: 17, Total Reward: -2.99, Epsilon: 0.01 Best model weights saved. Episode: 18, Total Reward: -370.53, Epsilon: 0.01 Episode: 19, Total Reward: -255.21, Epsilon: 0.01 Episode: 20, Total Reward: -133.31, Epsilon: 0.01 Episode: 21, Total Reward: -127.16, Epsilon: 0.01 Episode: 22, Total Reward: -132.96, Epsilon: 0.01 Episode: 23, Total Reward: -127.60, Epsilon: 0.01 Episode: 24, Total Reward: -254.34, Epsilon: 0.01 Episode: 25, Total Reward: -260.49, Epsilon: 0.01 Episode: 1, Total Reward: -960.86, Epsilon: 0.03 Best model weights saved. Episode: 2, Total Reward: -1540.72, Epsilon: 0.01 Episode: 3, Total Reward: -1574.19, Epsilon: 0.01 Episode: 4, Total Reward: -1450.09, Epsilon: 0.01 Episode: 5, Total Reward: -1040.59, Epsilon: 0.01 Episode: 6, Total Reward: -1263.53, Epsilon: 0.01 Episode: 7, Total Reward: -1018.05, Epsilon: 0.01 Episode: 8, Total Reward: -1049.92, Epsilon: 0.01 Episode: 9, Total Reward: -952.53, Epsilon: 0.01 Best model weights saved. Episode: 10, Total Reward: -502.23, Epsilon: 0.01 Best model weights saved. Episode: 11, Total Reward: -380.12, Epsilon: 0.01 Best model weights saved. Episode: 12, Total Reward: -242.24, Epsilon: 0.01 Best model weights saved. Episode: 13, Total Reward: -243.63, Epsilon: 0.01 Episode: 14, Total Reward: -363.33, Epsilon: 0.01 Episode: 15, Total Reward: -126.09, Epsilon: 0.01 Best model weights saved. Episode: 16, Total Reward: -240.12, Epsilon: 0.01 Episode: 17, Total Reward: -254.47, Epsilon: 0.01 Episode: 18, Total Reward: -255.69, Epsilon: 0.01 Episode: 19, Total Reward: -123.19, Epsilon: 0.01 Best model weights saved. Episode: 20, Total Reward: -1.12, Epsilon: 0.01 Best model weights saved. Episode: 21, Total Reward: -250.24, Epsilon: 0.01 Episode: 22, Total Reward: -124.13, Epsilon: 0.01 Episode: 23, Total Reward: -0.87, Epsilon: 0.01 Best model weights saved. Episode: 24, Total Reward: -238.99, Epsilon: 0.01 Episode: 25, Total Reward: -126.09, Epsilon: 0.01
plot_rewards_subplots(rewards_num_actions, num_actions_test, 'Number of Actions')
compute_moving_average_and_plot_subplots(rewards_num_actions, num_actions_test, 'Number of Actions')
GIF of highest reward attempt¶
Number of Actions = 3¶
Number of Actions = 5¶
Number of Actions = 21¶
Results¶
Similar to previous models, all the different action space sizes we tried allowed the DQN to balance the pendulum. The GIF shows that the DQN can maintain the pendulum in an upright and steady position. This demonstrates that the DQN has successfully learned the optimal policy for balancing the pendulum, indicating effective training and model convergence.
In terms of stability, the model with the number of actions set to 5 displayed the best performance, as seen from the minimal spikes in the reward and moving average curves.
The model with the number of actions set to 3 experienced some instability, with noticeable spikes in the middle of the training episodes. Between the first and second episodes, the rewards initially decreased and then began to increase.
The model with the number of actions set to 21 showed a similar trend between the first and second episodes, where the rewards first decreased and then started to rise. However, it experienced more instability in the middle of the training, as indicated by a higher number of spikes compared to the model with 5 actions.
Overall, the model with the number of actions set to 5 demonstrated the most stable training process. Improvement began right from the first episode, with a steady increase in the curve and no initial decrease. During the middle of training, this model also exhibited the best stability, with fewer spikes compared to the other two models.
We will use number of actions of 5 for future models.
Finding out the optimal value for Gamma ¶
What is Gamma¶
In Deep Q-Networks (DQN), gamma (γ) is known as the discount factor. It is a crucial hyperparameter that determines the importance of future rewards versus immediate rewards.
What are future rewards¶
In Reinforcement Learning (RL), future rewards refer to the rewards that an agent expects to receive from the environment after taking an action, beyond the immediate reward received after the action.
Taking the pendulum problem for example, the future rewards are the rewards received in subsequent time steps after the current action. In the Pendulum environment, this could involve maintaining the pendulum’s upright position for future time steps. The agent must consider how its current action will impact future states and rewards. For instance, applying a force to stabilize the pendulum might improve future rewards by keeping the pendulum in the upright position longer.
How gamma affects training stability¶
When gamma is close to 1, future rewards are valued almost as much as immediate rewards. This encourages the agent to consider long-term outcomes and plan actions that may not provide immediate benefits but lead to higher cumulative rewards in the future.
High gamma can lead to more stable training by fostering long-term planning and reducing the likelihood of overfitting to short-term rewards.
When gamma is close to 0, the agent focuses primarily on immediate rewards and discounts future rewards significantly. This simplifies the learning problem by reducing the complexity of future reward calculations.
Low can lead to faster learning and simpler models but may result in less stable training, as the agent might not effectively account for long-term consequences of its actions.
What will we be doing¶
Since our goal is to stabilize the training of the DQN, we will test three different gamma values close to 1 (not less than 0.9) to assess how they impact the stability and effectiveness of the learning process. By evaluating these high gamma values, we aim to determine which one provides the best balance between considering future rewards and ensuring stable convergence of the learning algorithm.
Initial value of Gamma: 0.98
Value of Gamma that we will be trying: [0.9, 0.94, 0.98]
References
https://iwaponline.com/ws/article/23/8/2986/96679/Study-on-Gamma-selection-in-the-optimal-operation
class Memory:
# Constructor, the capacity is the maximum size of the memory. Once capacity is reached, the old memories are removed.
def __init__(self, capacity):
self.memory = deque(maxlen=capacity)
'''
Adds transition to the memory. The transition is a tuple of (state, action, reward, next_state).
If the maximum capacity of memory is reached, the old memories are overrided.
'''
def update(self, transition):
self.memory.append(transition)
'''
Retrieve a random sample from memory, batch size indicates the number of samples to be retrieved.
'''
def sample(self, batch_size):
return random.sample(self.memory, batch_size)
class Net(tf.keras.Model):
# constructor initializes the layers
def __init__(self, input_size, num_actions):
super(Net, self).__init__()
self.dense1 = Dense(128, activation='relu', input_shape=(input_size,))
self.dense2 = Dense(128, activation='relu')
self.output_layer = Dense(num_actions, activation='linear')
'''
This function is the forward pass of the model. It takes the state as input and returns the Q values for each action.
'x' is the input to the model, which is the state from the environment. 'x' is then passed through the 2 dense
layers and the output layer to get the predicted values for each action. We use a linear activation function 'relu' to
predict a wide range of values, to estimate action values in the RL model
'''
def call(self, x):
# first dense layer
x = self.dense1(x)
# second dense layer
x = self.dense2(x)
return self.output_layer(x)
class Agent:
# this constructor initializes the environment, model, memory, and other variables required for the agent
def __init__(self, env_string, num_actions, state_size, batch_size=32, learning_rate=0.01, gamma=0.98, epsilon=1.0, epsilon_decay=0.98, epsilon_min=0.01, tau=0.01, memory_capacity=10000):
self.env_string = env_string
self.env = gym.make(env_string)
self.env.reset()
self.state_size = state_size
self.num_actions = num_actions
self.action_scope = [i * 4.0 / (num_actions - 1) - 2.0 for i in range(num_actions)] # Adjusted to match action scaling in dueling DQN
self.batch_size = batch_size
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
self.tau = tau
self.memory = Memory(memory_capacity)
self.best_total_reward = float('-inf')
self.model = Net(state_size, num_actions)
self.target_model = Net(state_size, num_actions)
self.optimizer = Adam(learning_rate)
self.loss_fn = tf.losses.MeanSquaredError()
'''
Implements the epsilon-greedy policy. With probability epsilon, a random action is selected, otherwise the action chosen will
be the one with the highest Q value. The epsilon value is decayed over time to reduce the exploration as the agent learns.
It also converts the state to a tensor before passing through to the model.
'''
def select_action(self, state):
# if random number is less than epsilon, return random action
if np.random.rand() < self.epsilon:
return np.random.randint(self.num_actions), None
# convert to tensor
state = tf.convert_to_tensor([state], dtype=tf.float32)
q_values = self.model(state)
# else, return action with highest Q value
return np.argmax(q_values.numpy()), None
# Store experience tuple into the memory buffer, function is above
def store_transition(self, state, action, reward, next_state):
self.memory.update((state, action, reward, next_state, False))
'''
Learn function performs single step training on a batch of experiences sampled from the memory buffer.
'''
def learn(self):
if len(self.memory.memory) < self.batch_size:
return
# sample a batch of transitions from the memory
transitions = self.memory.sample(self.batch_size)
# extract the states, actions, rewards, next states, from the batch
state_batch, action_batch, reward_batch, next_state_batch, _ = zip(*transitions)
# convert the s, a, r, s_' to tensors
state_batch = tf.convert_to_tensor(state_batch, dtype=tf.float32)
action_batch = tf.convert_to_tensor(action_batch, dtype=tf.int32)
reward_batch = tf.convert_to_tensor(reward_batch, dtype=tf.float32)
next_state_batch = tf.convert_to_tensor(next_state_batch, dtype=tf.float32)
'''
Tensorflow GradientTape is an API for automatic differentiation. It records operations for automatic differentiation.
This function will calculate the predicted Q-values from the current state and actions.
'''
with tf.GradientTape() as tape:
'''
Forward pass through the DQN model to get the Q-values for all actions given the current batch of states.
q-values contains the predicted q-values for each action in each state of the batch
'''
q_values = self.model(state_batch)
'''
'action_indices' calculates the indices of the actions taken in the Q-value matrix.
Since q_values contains Q-values for all actions, this step is necessary to select only the Q-values corresponding to the actions that were actually taken.
'''
action_indices = tf.range(self.batch_size) * self.num_actions + action_batch
'''
Reshapes the Q-value matrix to a single vector, then selects the Q-values for the actions taken using the 'action_indices' calculated above.
'''
predicted_q = tf.gather(tf.reshape(q_values, [-1]), action_indices)
'''
Forward pass through the target DQN model to get Q-values for all actions given the next states.
'''
next_q_values = self.target_model(next_state_batch)
# Finds the maximum Q-value among all actions for each next state, which represents the best possible future reward achievable from the next state.
max_next_q = tf.reduce_max(next_q_values, axis=1)
'''
Calculates the target Q-values using the immediate reward received (reward_batch) and the
discounted maximum future reward (self.gamma * max_next_q). This forms the update target for the Q-value of the action taken.
'''
target_q = reward_batch + self.gamma * max_next_q
# Calculates the loss between the predicted Q-values and the target Q-values, basically MSE loss
loss = self.loss_fn(target_q, predicted_q)
# Calculate the gradients of the loss with respect to the model parameters
grads = tape.gradient(loss, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))
self.update_epsilon()
# update the epsilon value
def update_epsilon(self):
self.epsilon = max(self.epsilon_min, self.epsilon_decay * self.epsilon)
# update the model
def update_target_model(self):
main_weights = self.model.get_weights()
target_weights = self.target_model.get_weights()
for i in range(len(target_weights)):
target_weights[i] = self.tau * main_weights[i] + (1 - self.tau) * target_weights[i]
self.target_model.set_weights(target_weights)
# checks if current model is the best model based on total reward
def is_best_model(self, total_reward):
return total_reward > self.best_total_reward
# updates the best model with the current model
def update_best_model(self, total_reward):
self.best_total_reward = total_reward
# Main function to run the training
def main(gamma):
env_string = 'Pendulum-v0'
num_actions = 5
state_size = gym.make(env_string).observation_space.shape[0]
gamma = gamma
agent = Agent(env_string, num_actions, state_size, gamma = gamma)
episodes = 25
rewards = [] # Initialize rewards list to store total rewards per episode
for ep in range(episodes):
state = agent.env.reset()
done = False
total_reward = 0
# Capture frames for GIF
frames = []
while not done:
# Render the environment and capture frames
frames.append(agent.env.render(mode='rgb_array'))
action_index, _ = agent.select_action(state)
action = [agent.action_scope[action_index]]
next_state, reward, done, _ = agent.env.step(action)
agent.store_transition(state, action_index, reward, next_state)
state = next_state
total_reward += reward
agent.learn()
agent.update_target_model()
rewards.append(total_reward) # Append the total reward of the episode to the rewards list
print(f"Episode: {ep+1}, Total Reward: {total_reward:.2f}, Epsilon: {agent.epsilon:.2f}")
if agent.is_best_model(total_reward):
# Update and save the best model weights
agent.update_best_model(total_reward)
model_dir = f'Weights/DQN/Gamma{gamma}'
save_model_weights(agent, model_dir)
print("Best model weights saved.")
# Directory where you want to save the files
save_dir = f'training_animations/DQN/Gamma{gamma}'
# Create the directory if it doesn't exist
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Save frames as GIF
gif_filename = os.path.join(save_dir, f'episode_{ep+1}.gif')
imageio.mimsave(gif_filename, frames, duration=0.00333, loop=0) # Adjust duration as needed
agent.env.close()
return rewards
gamma_values = [0.9, 0.94, 0.98]
rewards_gamma = []
for gamma in gamma_values:
# Run the training with the number of actions
rewards = main(gamma = gamma)
# Store the rewards for the current num_actions
rewards_gamma.append(rewards)
Episode: 1, Total Reward: -1211.91, Epsilon: 0.03 Best model weights saved. Episode: 2, Total Reward: -1737.79, Epsilon: 0.01 Episode: 3, Total Reward: -863.44, Epsilon: 0.01 Best model weights saved. Episode: 4, Total Reward: -1336.53, Epsilon: 0.01 Episode: 5, Total Reward: -798.29, Epsilon: 0.01 Best model weights saved. Episode: 6, Total Reward: -877.42, Epsilon: 0.01 Episode: 7, Total Reward: -134.96, Epsilon: 0.01 Best model weights saved. Episode: 8, Total Reward: -136.43, Epsilon: 0.01 Episode: 9, Total Reward: -664.87, Epsilon: 0.01 Episode: 10, Total Reward: -131.06, Epsilon: 0.01 Best model weights saved. Episode: 11, Total Reward: -264.82, Epsilon: 0.01 Episode: 12, Total Reward: -383.35, Epsilon: 0.01 Episode: 13, Total Reward: -128.81, Epsilon: 0.01 Best model weights saved. Episode: 14, Total Reward: -415.02, Epsilon: 0.01 Episode: 15, Total Reward: -348.23, Epsilon: 0.01 Episode: 16, Total Reward: -128.25, Epsilon: 0.01 Best model weights saved. Episode: 17, Total Reward: -271.87, Epsilon: 0.01 Episode: 18, Total Reward: -241.95, Epsilon: 0.01 Episode: 19, Total Reward: -248.85, Epsilon: 0.01 Episode: 20, Total Reward: -131.62, Epsilon: 0.01 Episode: 21, Total Reward: -142.56, Epsilon: 0.01 Episode: 22, Total Reward: -309.05, Epsilon: 0.01 Episode: 23, Total Reward: -264.61, Epsilon: 0.01 Episode: 24, Total Reward: -125.29, Epsilon: 0.01 Best model weights saved. Episode: 25, Total Reward: -5.83, Epsilon: 0.01 Best model weights saved. Episode: 1, Total Reward: -949.70, Epsilon: 0.03 Best model weights saved. Episode: 2, Total Reward: -1396.70, Epsilon: 0.01 Episode: 3, Total Reward: -1513.13, Epsilon: 0.01 Episode: 4, Total Reward: -1035.80, Epsilon: 0.01 Episode: 5, Total Reward: -1316.30, Epsilon: 0.01 Episode: 6, Total Reward: -812.77, Epsilon: 0.01 Best model weights saved. Episode: 7, Total Reward: -948.12, Epsilon: 0.01 Episode: 8, Total Reward: -678.90, Epsilon: 0.01 Best model weights saved. Episode: 9, Total Reward: -388.04, Epsilon: 0.01 Best model weights saved. Episode: 10, Total Reward: -135.84, Epsilon: 0.01 Best model weights saved. Episode: 11, Total Reward: -136.39, Epsilon: 0.01 Episode: 12, Total Reward: -260.81, Epsilon: 0.01 Episode: 13, Total Reward: -135.64, Epsilon: 0.01 Best model weights saved. Episode: 14, Total Reward: -125.66, Epsilon: 0.01 Best model weights saved. Episode: 15, Total Reward: -259.75, Epsilon: 0.01 Episode: 16, Total Reward: -132.16, Epsilon: 0.01 Episode: 17, Total Reward: -131.20, Epsilon: 0.01 Episode: 18, Total Reward: -493.40, Epsilon: 0.01 Episode: 19, Total Reward: -254.51, Epsilon: 0.01 Episode: 20, Total Reward: -125.75, Epsilon: 0.01 Episode: 21, Total Reward: -246.58, Epsilon: 0.01 Episode: 22, Total Reward: -131.55, Epsilon: 0.01 Episode: 23, Total Reward: -250.07, Epsilon: 0.01 Episode: 24, Total Reward: -341.17, Epsilon: 0.01 Episode: 25, Total Reward: -129.65, Epsilon: 0.01 Episode: 1, Total Reward: -875.18, Epsilon: 0.03 Best model weights saved. Episode: 2, Total Reward: -1618.42, Epsilon: 0.01 Episode: 3, Total Reward: -1363.49, Epsilon: 0.01 Episode: 4, Total Reward: -1169.59, Epsilon: 0.01 Episode: 5, Total Reward: -934.00, Epsilon: 0.01 Episode: 6, Total Reward: -642.76, Epsilon: 0.01 Best model weights saved. Episode: 7, Total Reward: -653.29, Epsilon: 0.01 Episode: 8, Total Reward: -262.05, Epsilon: 0.01 Best model weights saved. Episode: 9, Total Reward: -494.91, Epsilon: 0.01 Episode: 10, Total Reward: -346.41, Epsilon: 0.01 Episode: 11, Total Reward: -125.13, Epsilon: 0.01 Best model weights saved. Episode: 12, Total Reward: -130.47, Epsilon: 0.01 Episode: 13, Total Reward: -453.88, Epsilon: 0.01 Episode: 14, Total Reward: -249.68, Epsilon: 0.01 Episode: 15, Total Reward: -266.31, Epsilon: 0.01 Episode: 16, Total Reward: -238.99, Epsilon: 0.01 Episode: 17, Total Reward: -330.81, Epsilon: 0.01 Episode: 18, Total Reward: -127.00, Epsilon: 0.01 Episode: 19, Total Reward: -237.57, Epsilon: 0.01 Episode: 20, Total Reward: -118.95, Epsilon: 0.01 Best model weights saved. Episode: 21, Total Reward: -2.61, Epsilon: 0.01 Best model weights saved. Episode: 22, Total Reward: -459.97, Epsilon: 0.01 Episode: 23, Total Reward: -240.06, Epsilon: 0.01 Episode: 24, Total Reward: -244.36, Epsilon: 0.01 Episode: 25, Total Reward: -122.44, Epsilon: 0.01
plot_rewards_subplots(rewards_gamma, gamma_values, 'Gamma')
compute_moving_average_and_plot_subplots(rewards_gamma, gamma_values, 'Gamma')
GIF of highest reward attempt¶
Gamma = 0.9¶
Gamma = 0.94¶
Gamma = 0.98¶
Results¶
Similar to previous models, all the different gamma values we tried allowed the DQN to balance the pendulum. The GIF shows that the DQN can maintain the pendulum in an upright and steady position. This demonstrates that the DQN has successfully learned the optimal policy for balancing the pendulum, indicating effective training and model convergence.
In terms of stability, the model with a gamma value of 0.98 demonstrated the best performance, evidenced by minimal spikes in the reward and moving average curves.
The model with a gamma value of 0.9, despite showing signs of stability towards the end of training with smoothed-out rewards, experienced instability between episodes 3 and 12, as indicated by spikes in the rewards graph.
The model with a gamma value of 0.94 exhibited instability at the start of training, with noticeable spikes. This instability persisted throughout the training, with frequent fluctuations.
Overall, the model with a gamma value of 0.98 proved to be the most stable, showing a consistent and smooth curve from the start of training and maintaining minimal spikes throughout.
We will use a gamma value of 0.98 for future models.
Finding out the optimal value for learning rate ¶
What is the learning rate in DQN¶
The learning rate is a hyperparameter used in the optimizer of Deep Q-Networks (DQN). It dictates the size of the steps taken during the optimization process when adjusting the weights of the network. In the context of DQN, the optimizer is typically used to minimize the loss between the predicted Q-values and the target Q-values.
How learning rate affects training stability¶
A high learning rate can lead to faster convergence, as the model makes larger updates to the weights during training. If the learning rate is too high, it can cause the model to overshoot the optimal weights, leading to divergent behavior where the model oscillates or fails to converge. This instability can manifest as large fluctuations in the loss function or Q-values, preventing the model from settling into a stable and accurate solution.
A lower learning rate usually results in more stable and gradual updates to the model's weights, reducing the risk of overshooting and allowing the model to converge smoothly. However, if the learning rate is too low, the training process can become very slow, requiring many iterations to make meaningful progress. The model might also get stuck in local minima, where it doesn't improve significantly because the updates are too small.
What will we be doing¶
Since our goal is to stabilize the training of the DQN, we will test three different learning rates to assess how they impact the stability and effectiveness of the learning process. By evaluating these learning rates, we aim to determine which one provides the best balance between rapid learning and stable convergence. Specifically, we want to identify a learning rate that allows the DQN to learn efficiently without causing instability.
Initial value of learning rate: 0.01
Value of Gamma that we will be trying: [0.001, 0.01, 0.05]
References
class Memory:
# Constructor, the capacity is the maximum size of the memory. Once capacity is reached, the old memories are removed.
def __init__(self, capacity):
self.memory = deque(maxlen=capacity)
'''
Adds transition to the memory. The transition is a tuple of (state, action, reward, next_state).
If the maximum capacity of memory is reached, the old memories are overrided.
'''
def update(self, transition):
self.memory.append(transition)
'''
Retrieve a random sample from memory, batch size indicates the number of samples to be retrieved.
'''
def sample(self, batch_size):
return random.sample(self.memory, batch_size)
class Net(tf.keras.Model):
# constructor initializes the layers
def __init__(self, input_size, num_actions):
super(Net, self).__init__()
self.dense1 = Dense(128, activation='relu', input_shape=(input_size,))
self.dense2 = Dense(128, activation='relu')
self.output_layer = Dense(num_actions, activation='linear')
'''
This function is the forward pass of the model. It takes the state as input and returns the Q values for each action.
'x' is the input to the model, which is the state from the environment. 'x' is then passed through the 2 dense
layers and the output layer to get the predicted values for each action. We use a linear activation function 'relu' to
predict a wide range of values, to estimate action values in the RL model
'''
def call(self, x):
# first dense layer
x = self.dense1(x)
# second dense layer
x = self.dense2(x)
return self.output_layer(x)
class Agent:
# this constructor initializes the environment, model, memory, and other variables required for the agent
def __init__(self, env_string, num_actions, state_size, batch_size=32, learning_rate=0.01, gamma=0.98, epsilon=1.0, epsilon_decay=0.98, epsilon_min=0.01, tau=0.01, memory_capacity=10000):
self.env_string = env_string
self.env = gym.make(env_string)
self.env.reset()
self.state_size = state_size
self.num_actions = num_actions
self.action_scope = [i * 4.0 / (num_actions - 1) - 2.0 for i in range(num_actions)] # Adjusted to match action scaling in dueling DQN
self.batch_size = batch_size
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
self.tau = tau
self.memory = Memory(memory_capacity)
self.best_total_reward = float('-inf')
self.model = Net(state_size, num_actions)
self.target_model = Net(state_size, num_actions)
self.optimizer = Adam(learning_rate)
self.loss_fn = tf.losses.MeanSquaredError()
'''
Implements the epsilon-greedy policy. With probability epsilon, a random action is selected, otherwise the action chosen will
be the one with the highest Q value. The epsilon value is decayed over time to reduce the exploration as the agent learns.
It also converts the state to a tensor before passing through to the model.
'''
def select_action(self, state):
# if random number is less than epsilon, return random action
if np.random.rand() < self.epsilon:
return np.random.randint(self.num_actions), None
# convert to tensor
state = tf.convert_to_tensor([state], dtype=tf.float32)
q_values = self.model(state)
# else, return action with highest Q value
return np.argmax(q_values.numpy()), None
# Store experience tuple into the memory buffer, function is above
def store_transition(self, state, action, reward, next_state):
self.memory.update((state, action, reward, next_state, False))
'''
Learn function performs single step training on a batch of experiences sampled from the memory buffer.
'''
def learn(self):
if len(self.memory.memory) < self.batch_size:
return
# sample a batch of transitions from the memory
transitions = self.memory.sample(self.batch_size)
# extract the states, actions, rewards, next states, from the batch
state_batch, action_batch, reward_batch, next_state_batch, _ = zip(*transitions)
# convert the s, a, r, s_' to tensors
state_batch = tf.convert_to_tensor(state_batch, dtype=tf.float32)
action_batch = tf.convert_to_tensor(action_batch, dtype=tf.int32)
reward_batch = tf.convert_to_tensor(reward_batch, dtype=tf.float32)
next_state_batch = tf.convert_to_tensor(next_state_batch, dtype=tf.float32)
'''
Tensorflow GradientTape is an API for automatic differentiation. It records operations for automatic differentiation.
This function will calculate the predicted Q-values from the current state and actions.
'''
with tf.GradientTape() as tape:
'''
Forward pass through the DQN model to get the Q-values for all actions given the current batch of states.
q-values contains the predicted q-values for each action in each state of the batch
'''
q_values = self.model(state_batch)
'''
'action_indices' calculates the indices of the actions taken in the Q-value matrix.
Since q_values contains Q-values for all actions, this step is necessary to select only the Q-values corresponding to the actions that were actually taken.
'''
action_indices = tf.range(self.batch_size) * self.num_actions + action_batch
'''
Reshapes the Q-value matrix to a single vector, then selects the Q-values for the actions taken using the 'action_indices' calculated above.
'''
predicted_q = tf.gather(tf.reshape(q_values, [-1]), action_indices)
'''
Forward pass through the target DQN model to get Q-values for all actions given the next states.
'''
next_q_values = self.target_model(next_state_batch)
# Finds the maximum Q-value among all actions for each next state, which represents the best possible future reward achievable from the next state.
max_next_q = tf.reduce_max(next_q_values, axis=1)
'''
Calculates the target Q-values using the immediate reward received (reward_batch) and the
discounted maximum future reward (self.gamma * max_next_q). This forms the update target for the Q-value of the action taken.
'''
target_q = reward_batch + self.gamma * max_next_q
# Calculates the loss between the predicted Q-values and the target Q-values, basically MSE loss
loss = self.loss_fn(target_q, predicted_q)
# Calculate the gradients of the loss with respect to the model parameters
grads = tape.gradient(loss, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))
self.update_epsilon()
# update the epsilon value
def update_epsilon(self):
self.epsilon = max(self.epsilon_min, self.epsilon_decay * self.epsilon)
# update the model
def update_target_model(self):
main_weights = self.model.get_weights()
target_weights = self.target_model.get_weights()
for i in range(len(target_weights)):
target_weights[i] = self.tau * main_weights[i] + (1 - self.tau) * target_weights[i]
self.target_model.set_weights(target_weights)
# checks if current model is the best model based on total reward
def is_best_model(self, total_reward):
return total_reward > self.best_total_reward
# updates the best model with the current model
def update_best_model(self, total_reward):
self.best_total_reward = total_reward
# Main function to run the training
def main(learning_rate):
env_string = 'Pendulum-v0'
num_actions = 5
state_size = gym.make(env_string).observation_space.shape[0]
learning_rate = learning_rate
agent = Agent(env_string, num_actions, state_size, learning_rate = learning_rate)
episodes = 25
rewards = [] # Initialize rewards list to store total rewards per episode
for ep in range(episodes):
state = agent.env.reset()
done = False
total_reward = 0
# Capture frames for GIF
frames = []
while not done:
# Render the environment and capture frames
frames.append(agent.env.render(mode='rgb_array'))
action_index, _ = agent.select_action(state)
action = [agent.action_scope[action_index]]
next_state, reward, done, _ = agent.env.step(action)
agent.store_transition(state, action_index, reward, next_state)
state = next_state
total_reward += reward
agent.learn()
agent.update_target_model()
rewards.append(total_reward) # Append the total reward of the episode to the rewards list
print(f"Episode: {ep+1}, Total Reward: {total_reward:.2f}, Epsilon: {agent.epsilon:.2f}")
if agent.is_best_model(total_reward):
# Update and save the best model weights
agent.update_best_model(total_reward)
model_dir = f'Weights/DQN/learningRate{learning_rate}'
save_model_weights(agent, model_dir)
print("Best model weights saved.")
# Directory where you want to save the files
save_dir = f'training_animations/DQN/learningRate{learning_rate}'
# Create the directory if it doesn't exist
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Save frames as GIF
gif_filename = os.path.join(save_dir, f'episode_{ep+1}.gif')
imageio.mimsave(gif_filename, frames, duration=0.00333, loop=0) # Adjust duration as needed
agent.env.close()
return rewards
lr_values = [0.001, 0.01, 0.05]
rewards_lr = []
for lr in lr_values:
# Run the training with the number of actions
rewards = main(learning_rate = lr)
# Store the rewards for the current num_actions
rewards_lr.append(rewards)
Episode: 1, Total Reward: -1372.15, Epsilon: 0.03 Best model weights saved. Episode: 2, Total Reward: -1491.44, Epsilon: 0.01 Episode: 3, Total Reward: -1604.92, Epsilon: 0.01 Episode: 4, Total Reward: -1532.10, Epsilon: 0.01 Episode: 5, Total Reward: -1348.17, Epsilon: 0.01 Best model weights saved. Episode: 6, Total Reward: -1500.99, Epsilon: 0.01 Episode: 7, Total Reward: -1414.02, Epsilon: 0.01 Episode: 8, Total Reward: -1185.93, Epsilon: 0.01 Best model weights saved. Episode: 9, Total Reward: -1165.17, Epsilon: 0.01 Best model weights saved. Episode: 10, Total Reward: -1478.66, Epsilon: 0.01 Episode: 11, Total Reward: -1379.69, Epsilon: 0.01 Episode: 12, Total Reward: -1213.89, Epsilon: 0.01 Episode: 13, Total Reward: -1147.13, Epsilon: 0.01 Best model weights saved. Episode: 14, Total Reward: -1044.44, Epsilon: 0.01 Best model weights saved. Episode: 15, Total Reward: -964.76, Epsilon: 0.01 Best model weights saved. Episode: 16, Total Reward: -921.62, Epsilon: 0.01 Best model weights saved. Episode: 17, Total Reward: -768.61, Epsilon: 0.01 Best model weights saved. Episode: 18, Total Reward: -753.09, Epsilon: 0.01 Best model weights saved. Episode: 19, Total Reward: -659.48, Epsilon: 0.01 Best model weights saved. Episode: 20, Total Reward: -754.21, Epsilon: 0.01 Episode: 21, Total Reward: -637.48, Epsilon: 0.01 Best model weights saved. Episode: 22, Total Reward: -630.92, Epsilon: 0.01 Best model weights saved. Episode: 23, Total Reward: -557.26, Epsilon: 0.01 Best model weights saved. Episode: 24, Total Reward: -904.07, Epsilon: 0.01 Episode: 25, Total Reward: -861.51, Epsilon: 0.01 Episode: 1, Total Reward: -1443.30, Epsilon: 0.03 Best model weights saved. Episode: 2, Total Reward: -1295.84, Epsilon: 0.01 Best model weights saved. Episode: 3, Total Reward: -1473.85, Epsilon: 0.01 Episode: 4, Total Reward: -1283.95, Epsilon: 0.01 Best model weights saved. Episode: 5, Total Reward: -1025.95, Epsilon: 0.01 Best model weights saved. Episode: 6, Total Reward: -1039.43, Epsilon: 0.01 Episode: 7, Total Reward: -772.12, Epsilon: 0.01 Best model weights saved. Episode: 8, Total Reward: -717.16, Epsilon: 0.01 Best model weights saved. Episode: 9, Total Reward: -844.21, Epsilon: 0.01 Episode: 10, Total Reward: -387.20, Epsilon: 0.01 Best model weights saved. Episode: 11, Total Reward: -4.70, Epsilon: 0.01 Best model weights saved. Episode: 12, Total Reward: -234.32, Epsilon: 0.01 Episode: 13, Total Reward: -330.43, Epsilon: 0.01 Episode: 14, Total Reward: -389.30, Epsilon: 0.01 Episode: 15, Total Reward: -122.84, Epsilon: 0.01 Episode: 16, Total Reward: -227.96, Epsilon: 0.01 Episode: 17, Total Reward: -3.59, Epsilon: 0.01 Best model weights saved. Episode: 18, Total Reward: -128.41, Epsilon: 0.01 Episode: 19, Total Reward: -376.47, Epsilon: 0.01 Episode: 20, Total Reward: -122.69, Epsilon: 0.01 Episode: 21, Total Reward: -3.02, Epsilon: 0.01 Best model weights saved. Episode: 22, Total Reward: -124.10, Epsilon: 0.01 Episode: 23, Total Reward: -123.83, Epsilon: 0.01 Episode: 24, Total Reward: -118.63, Epsilon: 0.01 Episode: 25, Total Reward: -227.36, Epsilon: 0.01 Episode: 1, Total Reward: -1134.42, Epsilon: 0.03 Best model weights saved. Episode: 2, Total Reward: -1593.29, Epsilon: 0.01 Episode: 3, Total Reward: -1444.16, Epsilon: 0.01 Episode: 4, Total Reward: -1467.13, Epsilon: 0.01 Episode: 5, Total Reward: -1415.80, Epsilon: 0.01 Episode: 6, Total Reward: -1071.08, Epsilon: 0.01 Best model weights saved. Episode: 7, Total Reward: -1241.43, Epsilon: 0.01 Episode: 8, Total Reward: -1190.03, Epsilon: 0.01 Episode: 9, Total Reward: -907.60, Epsilon: 0.01 Best model weights saved. Episode: 10, Total Reward: -882.72, Epsilon: 0.01 Best model weights saved. Episode: 11, Total Reward: -501.11, Epsilon: 0.01 Best model weights saved. Episode: 12, Total Reward: -385.63, Epsilon: 0.01 Best model weights saved. Episode: 13, Total Reward: -2.29, Epsilon: 0.01 Best model weights saved. Episode: 14, Total Reward: -127.95, Epsilon: 0.01 Episode: 15, Total Reward: -507.61, Epsilon: 0.01 Episode: 16, Total Reward: -371.57, Epsilon: 0.01 Episode: 17, Total Reward: -727.71, Epsilon: 0.01 Episode: 18, Total Reward: -125.06, Epsilon: 0.01 Episode: 19, Total Reward: -129.09, Epsilon: 0.01 Episode: 20, Total Reward: -131.23, Epsilon: 0.01 Episode: 21, Total Reward: -255.52, Epsilon: 0.01 Episode: 22, Total Reward: -240.91, Epsilon: 0.01 Episode: 23, Total Reward: -354.77, Epsilon: 0.01 Episode: 24, Total Reward: -237.09, Epsilon: 0.01 Episode: 25, Total Reward: -489.19, Epsilon: 0.01
plot_rewards_subplots(rewards_lr, lr_values, 'Learning Rate')
compute_moving_average_and_plot_subplots(rewards_lr, lr_values, 'Learning Rate')
GIF of highest reward attempt¶
Learning Rate = 0.001¶
Learning Rate = 0.01¶
Learning Rate = 0.05¶
Results¶
The DQN model with a learning rate of 0.001 is unable to balance the pendulum, as indicated by consistently low rewards, never exceeding -500. As shown in the GIF, when the pendulum reaches the upright position, it quickly falls back down because the DQN fails to maintain its balance. This suggests that the model has not learned the optimal policy for stabilizing the pendulum, likely due to insufficient learning progress at this learning rate.
The DQN models with learning rates of 0.01 and 0.05 are able to balance the pendulum. The GIFs show that the DQNs can maintain the pendulum in an upright and steady position. This demonstrates that these models have successfully learned the optimal policy for balancing the pendulum, indicating effective training and model convergence.
In terms of stability, the model with a learning rate of 0.01 is the most stable, as evidenced by fewer violent spikes in the reward and moving average curves compared to the model with a learning rate of 0.05. The latter model experiences a sharp decrease in rewards around episodes 12 to 15.
We will use learning rate of 0.01 for future models.
Best Model for DQN ¶
What we found out from the parameter tunning¶
Our Goal is to stabalize the training of our DQN¶
We have used the Exploration vs Exploitation¶
This approach encourages the agent to explore various actions, thus gathering diverse experiences during the early stages of training. As the agent learns more about the environment, the epsilon value decreases, allowing the agent to exploit the learned policy more effectively. This balance prevents the agent from getting stuck in suboptimal policies and helps in stabilizing learning.
We have used soft updates instead of hard updates¶
We opted for soft updates to stabilize the training by gradually updating the target network's weights based on the main network's weights. This reduces the likelihood of drastic changes in the Q-values, which can occur with hard updates, and ensures a more stable convergence by preventing large oscillations or divergence in the learning process.
We have decreased number of episodes¶
By reducing the number of episodes, we focus on the quality of learning rather than quantity. This decision was based on observing that the agent's performance reached a strong level and fluctuated around an acceptable reward range within the set episodes. This approach prevents unnecessary training and potential overfitting, thereby stabilizing the training process.
We have Found the optimal value of tau to be at 0.01¶
The optimal tau value of 0.01 for soft updates ensures a stable and controlled update to the target network. A low tau value helps in smoothing out the updates, thereby avoiding abrupt changes that could destabilize the learning process.
We have increaed the number of dense neurons¶
Increasing the number of dense neurons in the network enhances the model's capacity to approximate the complex Q-function accurately. This increase allows the model to better capture the underlying patterns in the state-action space, leading to more stable learning and improved decision-making capabilities.
We have found the optimal number of actions to be at 5¶
By optimizing the number of discrete actions to 5, we found a balance that provides sufficient granularity for the agent's decisions while maintaining computational efficiency.
We have found the optimal value for Gamma to be at 0.98¶
Setting the gamma (discount factor) to 0.98 ensures that the agent appropriately values future rewards, striking a balance between immediate and future benefits. This balance is crucial for stable training, as it prevents the agent from becoming shortsighted (too low gamma) or overly optimistic (too high gamma) about future rewards, leading to more consistent and stable learning outcomes.
We have found the optimal value of learning rate at 0.01¶
The learning rate of 0.01 was identified as optimal for our DQN model. This learning rate provides a good balance between learning speed and stability. A learning rate that is too high can cause the model to overshoot the optimal parameters, leading to instability and poor convergence. Conversely, a learning rate that is too low can make the training process excessively slow and potentially result in suboptimal solutions due to insufficient exploration. The value of 0.01 helps the model to make steady progress towards finding an optimal policy while maintaining stability in training, thus enhancing both convergence and effectiveness.
class Memory:
# Constructor, the capacity is the maximum size of the memory. Once capacity is reached, the old memories are removed.
def __init__(self, capacity):
self.memory = deque(maxlen=capacity)
'''
Adds transition to the memory. The transition is a tuple of (state, action, reward, next_state).
If the maximum capacity of memory is reached, the old memories are overrided.
'''
def update(self, transition):
self.memory.append(transition)
'''
Retrieve a random sample from memory, batch size indicates the number of samples to be retrieved.
'''
def sample(self, batch_size):
return random.sample(self.memory, batch_size)
class Net(tf.keras.Model):
# constructor initializes the layers
def __init__(self, input_size, num_actions):
super(Net, self).__init__()
self.dense1 = Dense(128, activation='relu', input_shape=(input_size,))
self.dense2 = Dense(128, activation='relu')
self.output_layer = Dense(num_actions, activation='linear')
'''
This function is the forward pass of the model. It takes the state as input and returns the Q values for each action.
'x' is the input to the model, which is the state from the environment. 'x' is then passed through the 2 dense
layers and the output layer to get the predicted values for each action. We use a linear activation function 'relu' to
predict a wide range of values, to estimate action values in the RL model
'''
def call(self, x):
# first dense layer
x = self.dense1(x)
# second dense layer
x = self.dense2(x)
return self.output_layer(x)
class Agent:
# this constructor initializes the environment, model, memory, and other variables required for the agent
def __init__(self, env_string, num_actions, state_size, batch_size=32, learning_rate=0.01, gamma=0.98, epsilon=1.0, epsilon_decay=0.98, epsilon_min=0.01, tau=0.01, memory_capacity=10000):
self.env_string = env_string
self.env = gym.make(env_string)
self.env.reset()
self.state_size = state_size
self.num_actions = num_actions
self.action_scope = [i * 4.0 / (num_actions - 1) - 2.0 for i in range(num_actions)] # Adjusted to match action scaling in dueling DQN
self.batch_size = batch_size
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
self.tau = tau
self.memory = Memory(memory_capacity)
self.best_total_reward = float('-inf')
self.model = Net(state_size, num_actions)
self.target_model = Net(state_size, num_actions)
self.optimizer = Adam(learning_rate)
self.loss_fn = tf.losses.MeanSquaredError()
'''
Implements the epsilon-greedy policy. With probability epsilon, a random action is selected, otherwise the action chosen will
be the one with the highest Q value. The epsilon value is decayed over time to reduce the exploration as the agent learns.
It also converts the state to a tensor before passing through to the model.
'''
def select_action(self, state):
# if random number is less than epsilon, return random action
if np.random.rand() < self.epsilon:
return np.random.randint(self.num_actions), None
# convert to tensor
state = tf.convert_to_tensor([state], dtype=tf.float32)
q_values = self.model(state)
# else, return action with highest Q value
return np.argmax(q_values.numpy()), None
# Store experience tuple into the memory buffer, function is above
def store_transition(self, state, action, reward, next_state):
self.memory.update((state, action, reward, next_state, False))
'''
Learn function performs single step training on a batch of experiences sampled from the memory buffer.
'''
def learn(self):
if len(self.memory.memory) < self.batch_size:
return
# sample a batch of transitions from the memory
transitions = self.memory.sample(self.batch_size)
# extract the states, actions, rewards, next states, from the batch
state_batch, action_batch, reward_batch, next_state_batch, _ = zip(*transitions)
# convert the s, a, r, s_' to tensors
state_batch = tf.convert_to_tensor(state_batch, dtype=tf.float32)
action_batch = tf.convert_to_tensor(action_batch, dtype=tf.int32)
reward_batch = tf.convert_to_tensor(reward_batch, dtype=tf.float32)
next_state_batch = tf.convert_to_tensor(next_state_batch, dtype=tf.float32)
'''
Tensorflow GradientTape is an API for automatic differentiation. It records operations for automatic differentiation.
This function will calculate the predicted Q-values from the current state and actions.
'''
with tf.GradientTape() as tape:
'''
Forward pass through the DQN model to get the Q-values for all actions given the current batch of states.
q-values contains the predicted q-values for each action in each state of the batch
'''
q_values = self.model(state_batch)
'''
'action_indices' calculates the indices of the actions taken in the Q-value matrix.
Since q_values contains Q-values for all actions, this step is necessary to select only the Q-values corresponding to the actions that were actually taken.
'''
action_indices = tf.range(self.batch_size) * self.num_actions + action_batch
'''
Reshapes the Q-value matrix to a single vector, then selects the Q-values for the actions taken using the 'action_indices' calculated above.
'''
predicted_q = tf.gather(tf.reshape(q_values, [-1]), action_indices)
'''
Forward pass through the target DQN model to get Q-values for all actions given the next states.
'''
next_q_values = self.target_model(next_state_batch)
# Finds the maximum Q-value among all actions for each next state, which represents the best possible future reward achievable from the next state.
max_next_q = tf.reduce_max(next_q_values, axis=1)
'''
Calculates the target Q-values using the immediate reward received (reward_batch) and the
discounted maximum future reward (self.gamma * max_next_q). This forms the update target for the Q-value of the action taken.
'''
target_q = reward_batch + self.gamma * max_next_q
# Calculates the loss between the predicted Q-values and the target Q-values, basically MSE loss
loss = self.loss_fn(target_q, predicted_q)
# Calculate the gradients of the loss with respect to the model parameters
grads = tape.gradient(loss, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))
self.update_epsilon()
# update the epsilon value
def update_epsilon(self):
self.epsilon = max(self.epsilon_min, self.epsilon_decay * self.epsilon)
# update the model
def update_target_model(self):
main_weights = self.model.get_weights()
target_weights = self.target_model.get_weights()
for i in range(len(target_weights)):
target_weights[i] = self.tau * main_weights[i] + (1 - self.tau) * target_weights[i]
self.target_model.set_weights(target_weights)
# checks if current model is the best model based on total reward
def is_best_model(self, total_reward):
return total_reward > self.best_total_reward
# updates the best model with the current model
def update_best_model(self, total_reward):
self.best_total_reward = total_reward
# Main function to run the training
def main():
env_string = 'Pendulum-v0'
num_actions = 5
state_size = gym.make(env_string).observation_space.shape[0]
agent = Agent(env_string, num_actions, state_size)
episodes = 25
rewards = [] # Initialize rewards list to store total rewards per episode
for ep in range(episodes):
state = agent.env.reset()
done = False
total_reward = 0
# Capture frames for GIF
frames = []
while not done:
# Render the environment and capture frames
frames.append(agent.env.render(mode='rgb_array'))
action_index, _ = agent.select_action(state)
action = [agent.action_scope[action_index]]
next_state, reward, done, _ = agent.env.step(action)
agent.store_transition(state, action_index, reward, next_state)
state = next_state
total_reward += reward
agent.learn()
agent.update_target_model()
rewards.append(total_reward) # Append the total reward of the episode to the rewards list
print(f"Episode: {ep+1}, Total Reward: {total_reward:.2f}, Epsilon: {agent.epsilon:.2f}")
if agent.is_best_model(total_reward):
# Update and save the best model weights
agent.update_best_model(total_reward)
model_dir = 'Weights/DQN/Best'
save_model_weights(agent, model_dir)
print("Best model weights saved.")
# Directory where you want to save the files
save_dir = 'training_animations/DQN/Best'
# Create the directory if it doesn't exist
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Save frames as GIF
gif_filename = os.path.join(save_dir, f'episode_{ep+1}.gif')
imageio.mimsave(gif_filename, frames, duration=0.00333, loop=0) # Adjust duration as needed
agent.env.close()
return rewards
rewards = main()
Episode: 1, Total Reward: -1607.93, Epsilon: 0.03 Best model weights saved. Episode: 2, Total Reward: -1732.28, Epsilon: 0.01 Episode: 3, Total Reward: -1549.14, Epsilon: 0.01 Best model weights saved. Episode: 4, Total Reward: -998.83, Epsilon: 0.01 Best model weights saved. Episode: 5, Total Reward: -1071.18, Epsilon: 0.01 Episode: 6, Total Reward: -1247.82, Epsilon: 0.01 Episode: 7, Total Reward: -1227.71, Epsilon: 0.01 Episode: 8, Total Reward: -778.28, Epsilon: 0.01 Best model weights saved. Episode: 9, Total Reward: -619.46, Epsilon: 0.01 Best model weights saved. Episode: 10, Total Reward: -589.84, Epsilon: 0.01 Best model weights saved. Episode: 11, Total Reward: -366.38, Epsilon: 0.01 Best model weights saved. Episode: 12, Total Reward: -259.32, Epsilon: 0.01 Best model weights saved. Episode: 13, Total Reward: -4.25, Epsilon: 0.01 Best model weights saved. Episode: 14, Total Reward: -4.13, Epsilon: 0.01 Best model weights saved. Episode: 15, Total Reward: -369.27, Epsilon: 0.01 Episode: 16, Total Reward: -364.99, Epsilon: 0.01 Episode: 17, Total Reward: -122.58, Epsilon: 0.01 Episode: 18, Total Reward: -127.66, Epsilon: 0.01 Episode: 19, Total Reward: -128.92, Epsilon: 0.01 Episode: 20, Total Reward: -1.96, Epsilon: 0.01 Best model weights saved. Episode: 21, Total Reward: -117.32, Epsilon: 0.01 Episode: 22, Total Reward: -237.20, Epsilon: 0.01 Episode: 23, Total Reward: -392.66, Epsilon: 0.01 Episode: 24, Total Reward: -2.80, Epsilon: 0.01 Episode: 25, Total Reward: -126.18, Epsilon: 0.01
plot_rewards(rewards)
compute_moving_average_and_plot(rewards)
GIF of highest reward attempt¶
Results¶
We can see that the best DQN model is able to balance the pendulum effectively. As shown in the GIF, when the pendulum is in the upright position, the DQN model can apply torque (discretized action) to maintain the pendulum in this stable state. This suggests that the DQN has successfully learned the optimal policy for keeping the pendulum balanced, demonstrating effective training and convergence.
Additionally, in terms of stability during training, the model exhibits good performance. As seen from the rewards graph, despite an initial spike around the third episode, the rest of the training episodes show very few spikes, and those that do occur are not violent. This indicates a stable learning process. The moving average curve further confirms this stability, as it appears very smooth, suggesting consistent and steady progress throughout the training.
In terms of rewards, the model achieves an optimal reward of around -7, which is very close to the best possible reward of 0. This low negative value indicates that the model has minimized the cost associated with balancing the pendulum, reflecting highly successful learning and application of the learned policy. The ability to achieve such a near-optimal reward underscores the model's efficiency and effectiveness in solving the task.
In terms of learning speed, the model achieves the optimal reward of approximately -7 within around 15 episodes.
Overall, the model demonstrates excellent training stability while simultaneously reaching the optimal policy at a relatively fast pace. This balance between stability and efficiency highlights the model's effectiveness in learning and executing the desired task.
Running the Best DQN Weights Multiple Times (Test the Best DQN Model) ¶
What we did¶
During the training process, we saved the weights of the best-performing model (lowest reward). We then loaded these weights into a new model with the same architecture and ran the model in the Pendulum-v0 environment multiple times.
Why do we test the best DQN Model (Run the model multiple times)¶
It allows us to evaluate the effectiveness of the trained model. By running the best model multiple times in the environment, we can assess how well it performs on the task, including its ability to maximize rewards.
Testing helps ensure that the model has learned a generalizable policy rather than overfitting to the specific experiences encountered during training. This is important for confirming that the model can handle a variety of situations in the environment.
Testing helps ensure that the high rewards achieved by the model are not due to chance. By running the model multiple times, we can confirm that the observed performance is robust and not a result of random variations or anomalies in a single test run.
class Net(tf.keras.Model):
def __init__(self, input_size, num_actions):
super(Net, self).__init__()
self.dense1 = tf.keras.layers.Dense(128, activation='relu', input_shape=(input_size,))
self.dense2 = tf.keras.layers.Dense(128, activation='relu')
self.output_layer = tf.keras.layers.Dense(num_actions, activation='linear')
def call(self, inputs):
x = self.dense1(inputs)
x = self.dense2(x)
return self.output_layer(x)
def load_model(model_path, input_size, num_actions):
model = Net(input_size, num_actions)
# Create a dummy input and call the model to initialize variables
dummy_input = tf.zeros((1, input_size))
model(dummy_input)
# Now load the weights
model.load_weights(model_path)
return model
def test_model(env_string, model, dir_name, num_episodes=5):
env = gym.make(env_string)
rewards = []
for episode in range(num_episodes):
# Save the Frames
frames = []
state = env.reset()
done = False
total_reward = 0
while not done:
state = tf.convert_to_tensor([state], dtype=tf.float32)
action_values = model(state)
action = np.argmax(action_values.numpy()[0])
# Convert discrete action to continuous action for Pendulum
action = [action * 4.0 / (model.output_layer.units - 1) - 2.0]
next_state, reward, done, _ = env.step(action)
# Render the environment and capture frames
frames.append(env.render(mode='rgb_array'))
state = next_state
total_reward += reward
# Directory where you want to save the files
save_dir = f'test_animations/{dir_name}'
# Create the directory if it doesn't exist
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Save frames as GIF
gif_filename = os.path.join(save_dir, f'episode_{episode+1}.gif')
imageio.mimsave(gif_filename, frames, duration=0.00333, loop=0)
print(f"Episode: {episode+1}, Total Reward: {total_reward:.2f}")
rewards.append(total_reward)
env.close()
return rewards
if __name__ == '__main__':
env_string = 'Pendulum-v0'
input_size = gym.make(env_string).observation_space.shape[0]
num_actions = 5 # Must match the number used during training
model_dir = 'Weights/DQN/Best'
weights_path = os.path.join(model_dir, 'tensorflow_dqn_weights.h5')
# Ensure the model path exists
if os.path.exists(weights_path):
model = load_model(weights_path, input_size, num_actions)
rewards = test_model(env_string, model, 'DQN/Best', 20)
else:
print(f"Model weights not found in {weights_path}. Please ensure the correct path.")
Episode: 1, Total Reward: -125.75 Episode: 2, Total Reward: -125.99 Episode: 3, Total Reward: -116.36 Episode: 4, Total Reward: -119.58 Episode: 5, Total Reward: -124.95 Episode: 6, Total Reward: -495.55 Episode: 7, Total Reward: -1.04 Episode: 8, Total Reward: -122.47 Episode: 9, Total Reward: -233.94 Episode: 10, Total Reward: -122.37 Episode: 11, Total Reward: -117.13 Episode: 12, Total Reward: -1.57 Episode: 13, Total Reward: -123.51 Episode: 14, Total Reward: -118.90 Episode: 15, Total Reward: -360.44 Episode: 16, Total Reward: -117.50 Episode: 17, Total Reward: -2.00 Episode: 18, Total Reward: -373.40 Episode: 19, Total Reward: -128.37 Episode: 20, Total Reward: -117.28
plot_rewards(rewards)
display_gifs_in_grid('test_animations/DQN/Best')




















Results¶
We ran a total of 20 episodes using the best DQN model's weights and found that the model was able to balance the pendulum 100% of the time. The rewards ranged from -7 to -355. The variation in rewards can be attributed to the fact that the pendulum starts from random positions. This randomness in starting positions contributes to the difference in rewards, as the model has to adapt to different initial conditions to achieve balance. Despite these variations, the model consistently demonstrated its ability to maintain the pendulum's balance, showcasing its robustness and effectiveness in learning the optimal policy.
Double DQN ¶
What is Double DQN¶
Double DQN (Double Deep Q-Network) is an enhancement to the standard DQN algorithm designed to address the issue of overestimation bias in Q-value estimates. This bias, present in standard Q-learning and DQN algorithms, can lead to overly optimistic Q-value estimates and suboptimal policies.
Double DQN tackles this problem by decoupling action selection and action evaluation using two separate networks: an online network for selecting the best action and a target network for evaluating the Q-value of that action.
The algorithm employs two neural networks - a primary network and a target network - with the primary network used for action selection and the target network for computing target Q-values. This separation helps reduce the correlation between action selection and action evaluation, leading to more accurate Q-value estimates.
What is Overestimation Bias¶
Overestimation bias is where the estimated value of a certain action or state-action pair is consistently higher than its true value. This occurs because of the way the maximum operation in Q-learning selects and reinforces the highest estimated Q-value, even if that estimate is overly optimistic.
In traditional Q-learning or Single DQN algorithms, the action that maximizes the Q-value for the next state is both selected and evaluated using the same Q-network. If the Q-network has even a slight tendency to overestimate the value of actions, this overestimation can be reinforced during training. Over time, this bias accumulates, leading to an inflated and inaccurate estimate of the expected return for certain actions.
Overestimation bias can negatively impact the learning process, causing the agent to favor suboptimal actions because they appear more valuable than they actually are.
Difference between Double DQN and DQN¶
Single DQN uses a single neural network to both select and evaluate the best action, which can lead to overestimating the Q-values because it evaluates the maximum Q-value using the same network. In contrast, Double DQN mitigates this bias by using two separate networks: an online network to select actions and a target network to evaluate the selected actions. This separation helps in providing more accurate Q-value estimates, resulting in improved stability and performance during training.
References
Base Model ¶
Memory Class¶
The Memory class stores experience tuples (state, action, reward, next_state) and supports sampling for training. This design allows the agent to store and retrieve past experiences efficiently, facilitating experience replay which is crucial for stabilizing the training process in reinforcement learning.
Neural Network Architecture¶
The Net class defines the neural network architecture used for approximating Q-values. It features two hidden dense layers with ReLU activation functions and a final output layer with a linear activation function. This network is responsible for predicting Q-values for each action given a state, serving as the core component for the agent's decision-making process.
Agent Class and Learning Mechanism¶
The Agent class integrates the Memory and Net classes, implementing the core functionalities of Double DQN. It employs an epsilon-greedy policy for action selection, stores transitions in memory, and performs learning through experience replay. The learn method updates the model by minimizing the mean squared error between predicted and target Q-values, with the target values derived from the target network. The update_target_model method ensures that the target network’s weights are updated slowly towards the main network’s weights, thereby reducing the risk of divergence.
Double DQN Implementation¶
This code represents a base model for Double DQN due to its key features: it uses a target network to stabilize training, and it incorporates Double DQN’s key innovation—using the main network to select actions and the target network to evaluate them, thus mitigating overestimation bias.
class Memory:
# Constructor, the capacity is the maximum size of the memory. Once capacity is reached, the old memories are removed.
def __init__(self, capacity):
self.memory = deque(maxlen=capacity)
'''
Adds transition to the memory. The transition is a tuple of (state, action, reward, next_state).
If the maximum capacity of memory is reached, the old memories are overrided.
'''
def update(self, transition):
self.memory.append(transition)
'''
Retrieve a random sample from memory, batch size indicates the number of samples to be retrieved.
'''
def sample(self, batch_size):
return random.sample(self.memory, batch_size)
class Net(tf.keras.Model):
# constructor initializes the layers
def __init__(self, input_size, num_actions):
super(Net, self).__init__()
self.dense1 = Dense(128, activation='relu', input_shape=(input_size,))
self.dense2 = Dense(128, activation='relu')
self.output_layer = Dense(num_actions, activation='linear')
'''
This function is the forward pass of the model. It takes the state as input and returns the Q values for each action.
'x' is the input to the model, which is the state from the environment. 'x' is then passed through the 2 dense
layers and the output layer to get the predicted values for each action. We use a linear activation function 'relu' to
predict a wide range of values, to estimate action values in the RL model
'''
def call(self, x):
# first dense layer
x = self.dense1(x)
# second dense layer
x = self.dense2(x)
return self.output_layer(x)
class Agent:
# this constructor initializes the environment, model, memory, and other variables required for the agent
def __init__(self, env_string, num_actions, state_size, batch_size=32, learning_rate=0.01, gamma=0.98, epsilon=1.0, epsilon_decay=0.98, epsilon_min=0.01, tau=0.01, memory_capacity=10000):
self.env_string = env_string
self.env = gym.make(env_string)
self.env.reset()
self.state_size = state_size
self.num_actions = num_actions
self.action_scope = [i * 4.0 / (num_actions - 1) - 2.0 for i in range(num_actions)] # Adjusted to match action scaling in dueling DQN
self.batch_size = batch_size
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
self.tau = tau
self.memory = Memory(memory_capacity)
self.best_total_reward = float('-inf')
self.model = Net(state_size, num_actions)
self.target_model = Net(state_size, num_actions)
self.optimizer = Adam(learning_rate)
self.loss_fn = tf.losses.MeanSquaredError()
'''
Implements the epsilon-greedy policy. With probability epsilon, a random action is selected, otherwise the action chosen will
be the one with the highest Q value. The epsilon value is decayed over time to reduce the exploration as the agent learns.
It also converts the state to a tensor before passing through to the model.
'''
def select_action(self, state):
# if random number is less than epsilon, return random action
if np.random.rand() < self.epsilon:
return np.random.randint(self.num_actions), None
# convert to tensor
state = tf.convert_to_tensor([state], dtype=tf.float32)
q_values = self.model(state)
# else, return action with highest Q value
return np.argmax(q_values.numpy()), None
# Store experience tuple into the memory buffer, function is above
def store_transition(self, state, action, reward, next_state):
self.memory.update((state, action, reward, next_state, False))
'''
Learn function performs single step training on a batch of experiences sampled from the memory buffer.
'''
def learn(self):
if len(self.memory.memory) < self.batch_size:
return
# sample a batch of transitions from the memory
transitions = self.memory.sample(self.batch_size)
# extract the states, actions, rewards, next states, from the batch
state_batch, action_batch, reward_batch, next_state_batch, _ = zip(*transitions)
# convert the s, a, r, s_' to tensors
state_batch = tf.convert_to_tensor(state_batch, dtype=tf.float32)
action_batch = tf.convert_to_tensor(action_batch, dtype=tf.int32)
reward_batch = tf.convert_to_tensor(reward_batch, dtype=tf.float32)
next_state_batch = tf.convert_to_tensor(next_state_batch, dtype=tf.float32)
'''
Tensorflow GradientTape is an API for automatic differentiation. It records operations for automatic differentiation.
This function will calculate the predicted Q-values from the current state and actions.
'''
with tf.GradientTape() as tape:
'''
Forward pass through the DQN model to get the Q-values for all actions given the current batch of states.
q-values contains the predicted q-values for each action in each state of the batch
'''
q_values = self.model(state_batch)
'''
'action_indices' calculates the indices of the actions taken in the Q-value matrix.
Since q_values contains Q-values for all actions, this step is necessary to select only the Q-values corresponding to the actions that were actually taken.
'''
action_indices = tf.range(self.batch_size, dtype=tf.int32) * self.num_actions + action_batch
'''
Reshapes the Q-value matrix to a single vector, then selects the Q-values for the actions taken using the 'action_indices' calculated above.
'''
predicted_q = tf.gather(tf.reshape(q_values, [-1]), action_indices)
'''
Forward pass through the target DQN model to get Q-values for all actions given the next states.
'''
next_q_values = self.target_model(next_state_batch)
# Use the main network to select actions for the next state
next_action_indices = tf.argmax(self.model(next_state_batch), axis=1, output_type=tf.int32)
# Gather the Q-values for the actions taken by the next state from the target network
next_q_values_target = tf.gather(tf.reshape(next_q_values, [-1]),
tf.range(self.batch_size, dtype=tf.int32) * self.num_actions + next_action_indices)
'''
Calculates the target Q-values using the immediate reward received (reward_batch) and the
discounted maximum future reward (self.gamma * max_next_q). This forms the update target for the Q-value of the action taken.
'''
target_q = reward_batch + self.gamma * next_q_values_target
# Calculates the loss between the predicted Q-values and the target Q-values, basically MSE loss
loss = self.loss_fn(target_q, predicted_q)
# Calculate the gradients of the loss with respect to the model parameters
grads = tape.gradient(loss, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))
self.update_epsilon()
# update the epsilon value
def update_epsilon(self):
self.epsilon = max(self.epsilon_min, self.epsilon_decay * self.epsilon)
# update the model
def update_target_model(self):
main_weights = self.model.get_weights()
target_weights = self.target_model.get_weights()
for i in range(len(target_weights)):
target_weights[i] = self.tau * main_weights[i] + (1 - self.tau) * target_weights[i]
self.target_model.set_weights(target_weights)
# checks if current model is the best model based on total reward
def is_best_model(self, total_reward):
return total_reward > self.best_total_reward
# updates the best model with the current model
def update_best_model(self, total_reward):
self.best_total_reward = total_reward
# Main function to run the training
def main():
env_string = 'Pendulum-v0'
num_actions = 5
state_size = gym.make(env_string).observation_space.shape[0]
agent = Agent(env_string, num_actions, state_size)
episodes = 25
rewards = [] # Initialize rewards list to store total rewards per episode
for ep in range(episodes):
state = agent.env.reset()
done = False
total_reward = 0
# Capture frames for GIF
frames = []
while not done:
# Render the environment and capture frames
frames.append(agent.env.render(mode='rgb_array'))
action_index, _ = agent.select_action(state)
action = [agent.action_scope[action_index]]
next_state, reward, done, _ = agent.env.step(action)
agent.store_transition(state, action_index, reward, next_state)
state = next_state
total_reward += reward
agent.learn()
agent.update_target_model()
rewards.append(total_reward) # Append the total reward of the episode to the rewards list
print(f"Episode: {ep+1}, Total Reward: {total_reward:.2f}, Epsilon: {agent.epsilon:.2f}")
if agent.is_best_model(total_reward):
# Update and save the best model weights
agent.update_best_model(total_reward)
model_dir = 'Weights/DoubleDQN/Base'
save_model_weights(agent, model_dir)
print("Best model weights saved.")
# Directory where you want to save the files
save_dir = 'training_animations/DoubleDQN/Base'
# Create the directory if it doesn't exist
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Save frames as GIF
gif_filename = os.path.join(save_dir, f'episode_{ep+1}.gif')
imageio.mimsave(gif_filename, frames, duration=0.00333, loop=0) # Adjust duration as needed
agent.env.close()
return rewards
rewards = main()
Episode: 1, Total Reward: -1115.30, Epsilon: 0.03 Best model weights saved. Episode: 2, Total Reward: -1226.16, Epsilon: 0.01 Episode: 3, Total Reward: -1598.33, Epsilon: 0.01 Episode: 4, Total Reward: -1187.60, Epsilon: 0.01 Episode: 5, Total Reward: -1052.81, Epsilon: 0.01 Best model weights saved. Episode: 6, Total Reward: -522.12, Epsilon: 0.01 Best model weights saved. Episode: 7, Total Reward: -723.88, Epsilon: 0.01 Episode: 8, Total Reward: -132.93, Epsilon: 0.01 Best model weights saved. Episode: 9, Total Reward: -5.55, Epsilon: 0.01 Best model weights saved. Episode: 10, Total Reward: -504.30, Epsilon: 0.01 Episode: 11, Total Reward: -374.22, Epsilon: 0.01 Episode: 12, Total Reward: -2.50, Epsilon: 0.01 Best model weights saved. Episode: 13, Total Reward: -129.26, Epsilon: 0.01 Episode: 14, Total Reward: -252.06, Epsilon: 0.01 Episode: 15, Total Reward: -387.01, Epsilon: 0.01 Episode: 16, Total Reward: -306.69, Epsilon: 0.01 Episode: 17, Total Reward: -125.63, Epsilon: 0.01 Episode: 18, Total Reward: -382.77, Epsilon: 0.01 Episode: 19, Total Reward: -2.82, Epsilon: 0.01 Episode: 20, Total Reward: -123.51, Epsilon: 0.01 Episode: 21, Total Reward: -292.76, Epsilon: 0.01 Episode: 22, Total Reward: -124.42, Epsilon: 0.01 Episode: 23, Total Reward: -5.62, Epsilon: 0.01 Episode: 24, Total Reward: -124.10, Epsilon: 0.01 Episode: 25, Total Reward: -129.41, Epsilon: 0.01
plot_rewards(rewards)
compute_moving_average_and_plot(rewards)
Results
The Double DQN model is performing well. The Double DQN model is performing well, demonstrating its capability to effectively manage the pendulum's orientation. By maintaining the pendulum in an upright and steady position, the model successfully showcases the benefits of Double DQN in reducing overestimation bias and stabilizing training.
Additionally, the rewards are very good. At the highest point, the reward is around -2.50, suggesting excellent performance. In the context of the pendulum balancing task, this high reward indicates that the model is close from achieving the goal of keeping the pendulum upright for extended periods. A well-performing model would be expected to achieve significantly higher rewards, ideally approaching or exceeding zero, which would indicate successful balancing.
However, from the Rewards plot and moving average plot, the training exhibits significant instability, characterized by violent spikes across the entire graph. This high volatility in the learning process suggests that the Double DQN is experiencing erratic updates. This instability may be due to several factors. The model might be facing issues with learning rate, experience replay buffer size, or the balance between exploration and exploitation.
Next, we will explore gradient clipping as a potential solution to the instability observed in the training process. Gradient clipping is a technique used to prevent the gradients from becoming too large during training, which can help mitigate issues such as exploding gradients that contribute to erratic updates and instablity.
Gradient Clipping ¶
What is Gradient Clipping¶
Gradient Clipping is a technique used to constraining the magnitude of the gradients calculated during the backpropagation step to prevent them from becoming excessively large. It is used to enhance the stability and performance of training algorithms, particularly those involving neural networks for function approximation, such as policy gradients and value function approximations.
Why should we use it?¶
Large Gradient Updates: RL algorithms, especially those that use function approximation, can experience large gradients due to the complexity of learning from rewards and states. These large gradients can lead to unstable updates and poor performance.
Exploding Gradients: In RL, gradients can sometimes grow exponentially, especially when training deep networks or when the reward signal is sparse and highly variable. Gradient clipping helps mitigate this issue.
Benefits of Gradient Clipping¶
Controlled Updates: Gradient clipping prevents excessively large updates to the network weights, leading to more stable training. This is crucial in RL where training can be inherently noisy and unstable due to the nature of reward signals and environment interactions.
Steady Learning: By controlling gradient magnitudes, clipping ensures that the learning process remains steady and does not experience sudden jumps, which can help in achieving faster and more reliable convergence to optimal policies or value functions.
Smooth Updates: Gradient clipping helps in reducing oscillations and erratic behavior in training, leading to smoother updates and more consistent learning.
class Memory:
# Constructor, the capacity is the maximum size of the memory. Once capacity is reached, the old memories are removed.
def __init__(self, capacity):
self.memory = deque(maxlen=capacity)
'''
Adds transition to the memory. The transition is a tuple of (state, action, reward, next_state).
If the maximum capacity of memory is reached, the old memories are overrided.
'''
def update(self, transition):
self.memory.append(transition)
'''
Retrieve a random sample from memory, batch size indicates the number of samples to be retrieved.
'''
def sample(self, batch_size):
return random.sample(self.memory, batch_size)
class Net(tf.keras.Model):
# constructor initializes the layers
def __init__(self, input_size, num_actions):
super(Net, self).__init__()
self.dense1 = Dense(128, activation='relu', input_shape=(input_size,))
self.dense2 = Dense(128, activation='relu')
self.output_layer = Dense(num_actions, activation='linear')
'''
This function is the forward pass of the model. It takes the state as input and returns the Q values for each action.
'x' is the input to the model, which is the state from the environment. 'x' is then passed through the 2 dense
layers and the output layer to get the predicted values for each action. We use a linear activation function 'relu' to
predict a wide range of values, to estimate action values in the RL model
'''
def call(self, x):
# first dense layer
x = self.dense1(x)
# second dense layer
x = self.dense2(x)
return self.output_layer(x)
class Agent:
# this constructor initializes the environment, model, memory, and other variables required for the agent
def __init__(self, env_string, num_actions, state_size, batch_size=32, learning_rate=0.01, gamma=0.98, epsilon=1.0, epsilon_decay=0.98, epsilon_min=0.01, tau=0.01, memory_capacity=10000):
self.env_string = env_string
self.env = gym.make(env_string)
self.env.reset()
self.state_size = state_size
self.num_actions = num_actions
self.action_scope = [i * 4.0 / (num_actions - 1) - 2.0 for i in range(num_actions)] # Adjusted to match action scaling in dueling DQN
self.batch_size = batch_size
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
self.tau = tau
self.memory = Memory(memory_capacity)
self.best_total_reward = float('-inf')
self.model = Net(state_size, num_actions)
self.target_model = Net(state_size, num_actions)
self.optimizer = Adam(learning_rate)
self.loss_fn = tf.losses.MeanSquaredError()
'''
Implements the epsilon-greedy policy. With probability epsilon, a random action is selected, otherwise the action chosen will
be the one with the highest Q value. The epsilon value is decayed over time to reduce the exploration as the agent learns.
It also converts the state to a tensor before passing through to the model.
'''
def select_action(self, state):
# if random number is less than epsilon, return random action
if np.random.rand() < self.epsilon:
return np.random.randint(self.num_actions), None
# convert to tensor
state = tf.convert_to_tensor([state], dtype=tf.float32)
q_values = self.model(state)
# else, return action with highest Q value
return np.argmax(q_values.numpy()), None
# Store experience tuple into the memory buffer, function is above
def store_transition(self, state, action, reward, next_state):
self.memory.update((state, action, reward, next_state, False))
'''
Learn function performs single step training on a batch of experiences sampled from the memory buffer.
'''
def learn(self):
if len(self.memory.memory) < self.batch_size:
return
# sample a batch of transitions from the memory
transitions = self.memory.sample(self.batch_size)
# extract the states, actions, rewards, next states, from the batch
state_batch, action_batch, reward_batch, next_state_batch, _ = zip(*transitions)
# convert the s, a, r, s_' to tensors
state_batch = tf.convert_to_tensor(state_batch, dtype=tf.float32)
action_batch = tf.convert_to_tensor(action_batch, dtype=tf.int32)
reward_batch = tf.convert_to_tensor(reward_batch, dtype=tf.float32)
next_state_batch = tf.convert_to_tensor(next_state_batch, dtype=tf.float32)
'''
Tensorflow GradientTape is an API for automatic differentiation. It records operations for automatic differentiation.
This function will calculate the predicted Q-values from the current state and actions.
'''
with tf.GradientTape() as tape:
'''
Forward pass through the DQN model to get the Q-values for all actions given the current batch of states.
q-values contains the predicted q-values for each action in each state of the batch
'''
q_values = self.model(state_batch)
'''
'action_indices' calculates the indices of the actions taken in the Q-value matrix.
Since q_values contains Q-values for all actions, this step is necessary to select only the Q-values corresponding to the actions that were actually taken.
'''
action_indices = tf.range(self.batch_size, dtype=tf.int32) * self.num_actions + action_batch
'''
Reshapes the Q-value matrix to a single vector, then selects the Q-values for the actions taken using the 'action_indices' calculated above.
'''
predicted_q = tf.gather(tf.reshape(q_values, [-1]), action_indices)
'''
Forward pass through the target DQN model to get Q-values for all actions given the next states.
'''
next_q_values = self.target_model(next_state_batch)
# Use the main network to select actions for the next state
next_action_indices = tf.argmax(self.model(next_state_batch), axis=1, output_type=tf.int32)
# Gather the Q-values for the actions taken by the next state from the target network
next_q_values_target = tf.gather(tf.reshape(next_q_values, [-1]),
tf.range(self.batch_size, dtype=tf.int32) * self.num_actions + next_action_indices)
'''
Calculates the target Q-values using the immediate reward received (reward_batch) and the
discounted maximum future reward (self.gamma * max_next_q). This forms the update target for the Q-value of the action taken.
'''
target_q = reward_batch + self.gamma * next_q_values_target
# Calculates the loss between the predicted Q-values and the target Q-values, basically MSE loss
loss = self.loss_fn(target_q, predicted_q)
# Calculate the gradients of the loss with respect to the model parameters
grads = tape.gradient(loss, self.model.trainable_variables)
clipped_grads = [tf.clip_by_value(grad, -1.0, 1.0) for grad in grads]
self.optimizer.apply_gradients(zip(clipped_grads, self.model.trainable_variables))
self.update_epsilon()
# update the epsilon value
def update_epsilon(self):
self.epsilon = max(self.epsilon_min, self.epsilon_decay * self.epsilon)
# update the model
def update_target_model(self):
main_weights = self.model.get_weights()
target_weights = self.target_model.get_weights()
for i in range(len(target_weights)):
target_weights[i] = self.tau * main_weights[i] + (1 - self.tau) * target_weights[i]
self.target_model.set_weights(target_weights)
# checks if current model is the best model based on total reward
def is_best_model(self, total_reward):
return total_reward > self.best_total_reward
# updates the best model with the current model
def update_best_model(self, total_reward):
self.best_total_reward = total_reward
# Main function to run the training
def main():
env_string = 'Pendulum-v0'
num_actions = 5
state_size = gym.make(env_string).observation_space.shape[0]
agent = Agent(env_string, num_actions, state_size)
episodes = 25
rewards = [] # Initialize rewards list to store total rewards per episode
for ep in range(episodes):
state = agent.env.reset()
done = False
total_reward = 0
# Capture frames for GIF
frames = []
while not done:
# Render the environment and capture frames
frames.append(agent.env.render(mode='rgb_array'))
action_index, _ = agent.select_action(state)
action = [agent.action_scope[action_index]]
next_state, reward, done, _ = agent.env.step(action)
agent.store_transition(state, action_index, reward, next_state)
state = next_state
total_reward += reward
agent.learn()
agent.update_target_model()
# Adjust epsilon after each episode
agent.update_epsilon()
rewards.append(total_reward) # Append the total reward of the episode to the rewards list
print(f"Episode: {ep+1}, Total Reward: {total_reward:.2f}, Epsilon: {agent.epsilon:.2f}")
if agent.is_best_model(total_reward):
# Update and save the best model weights
agent.update_best_model(total_reward)
model_dir = 'Weights/DoubleDQN/gradientClip'
save_model_weights(agent, model_dir)
print("Best model weights saved.")
# Directory where you want to save the files
save_dir = 'training_animations/DoubleDQN/GradientClipping'
# Create the directory if it doesn't exist
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Save frames as GIF
gif_filename = os.path.join(save_dir, f'episode_{ep+1}.gif')
imageio.mimsave(gif_filename, frames, duration=0.00333, loop=0) # Adjust duration as needed
agent.env.close()
return rewards
rewards = main()
Episode: 1, Total Reward: -1705.84, Epsilon: 0.03 Best model weights saved. Episode: 2, Total Reward: -1744.61, Epsilon: 0.01 Episode: 3, Total Reward: -1359.71, Epsilon: 0.01 Best model weights saved. Episode: 4, Total Reward: -1522.26, Epsilon: 0.01 Episode: 5, Total Reward: -935.61, Epsilon: 0.01 Best model weights saved. Episode: 6, Total Reward: -281.78, Epsilon: 0.01 Best model weights saved. Episode: 7, Total Reward: -1067.62, Epsilon: 0.01 Episode: 8, Total Reward: -290.45, Epsilon: 0.01 Episode: 9, Total Reward: -656.42, Epsilon: 0.01 Episode: 10, Total Reward: -1234.10, Epsilon: 0.01 Episode: 11, Total Reward: -372.96, Epsilon: 0.01 Episode: 12, Total Reward: -233.19, Epsilon: 0.01 Best model weights saved. Episode: 13, Total Reward: -256.82, Epsilon: 0.01 Episode: 14, Total Reward: -255.40, Epsilon: 0.01 Episode: 15, Total Reward: -130.00, Epsilon: 0.01 Best model weights saved. Episode: 16, Total Reward: -364.20, Epsilon: 0.01 Episode: 17, Total Reward: -141.31, Epsilon: 0.01 Episode: 18, Total Reward: -261.31, Epsilon: 0.01 Episode: 19, Total Reward: -257.46, Epsilon: 0.01 Episode: 20, Total Reward: -128.80, Epsilon: 0.01 Best model weights saved. Episode: 21, Total Reward: -380.72, Epsilon: 0.01 Episode: 22, Total Reward: -248.83, Epsilon: 0.01 Episode: 23, Total Reward: -257.57, Epsilon: 0.01 Episode: 24, Total Reward: -243.59, Epsilon: 0.01 Episode: 25, Total Reward: -124.95, Epsilon: 0.01 Best model weights saved.
plot_rewards(rewards)
compute_moving_average_and_plot(rewards)
Results
After incorporating gradient clipping, the new model's performance aligns closely with the base Double DQN model. It effectively keeps the pendulum upright but with slightly reduced stability compared to the base model. This suggests that while gradient clipping has improved training stability by addressing volatility issues, it has not fully optimized the model's performance.
The Rewards plot and moving average plot show enhanced stability after a few episodes, with smoother curves indicating reduced volatility in the learning process. This improvement demonstrates that gradient clipping has led to a more stable training phase compared to the earlier high volatility.
Despite this progress, the highest reward achieved is around -124.95, indicating subpar performance. In the context of the pendulum balancing task, this low reward suggests that the model is still struggling to maintain the pendulum upright for extended periods, revealing that gradient clipping alone has not significantly improved the model's effectiveness in balancing.
To address this, we will explore Prioritized Experience Replay (PER). PER enhances traditional experience replay by prioritizing transitions with high temporal-difference (TD) error, allowing the model to focus on more informative experiences. This approach aims to improve learning efficiency and effectiveness by concentrating on the most relevant transitions to reduce overall prediction error.
Prioritized Experience Replay ¶
What is Prioritized Memory?¶
Prioritized Memory (or Prioritized Experience Replay) is a technique used in reinforcement learning to improve the efficiency and effectiveness of experience replay. In traditional experience replay, transitions (state, action, reward, next state) are stored in a memory buffer and sampled uniformly during training. In contrast, Prioritized Memory assigns different priorities to transitions based on their importance, and samples transitions with higher priority more frequently.
Key Concepts¶
Priority: Each transition is assigned a priority, which reflects its importance. Transitions that have higher priority are sampled more often.
Importance Sampling: To correct for the bias introduced by the non-uniform sampling, importance sampling weights are applied during the learning process.
How is it different from normal memory?¶
Sampling Mechanism:
- Normal Memory: Transitions are sampled uniformly at random. Every transition has an equal probability of being sampled.
- Transitions are sampled based on their priority. Transitions with higher priorities are sampled more frequently.
Tranisiton Storage:
- Normal Memory: Transitions are stored without any associated priority. All transitions are treated equally.
- Prioritized Memory: Transitions are stored with associated priorities, which influence how often they are sampled.
Priority Update
- Normal Memory: No priority update is needed as transitions are treated equally.
- Prioritized Memory: Priorities are updated as new experiences are gathered and the network is trained. This requires maintaining and updating a priority queue or similar structure.
How Does It Help the Double DQN Model?¶
Focus on Important Transitions: Prioritized Memory ensures that more informative transitions (those with high TD error) are replayed more frequently. This helps the model learn more effectively from valuable experiences, which can accelerate convergence.
Enhanced Exploration: By focusing on transitions with high TD error, the agent can explore more diverse and critical states, improving the learning of rare but important experiences. In addition, by prioritizing important transitions, the agent makes better use of the data it gathers. This means that the learning process becomes more sample-efficient, which contributes to training stability and faster convergence.
Reduced Overestimation Bias: Prioritized Memory, when used in conjunction with Double DQN, helps mitigate the overestimation bias present in Q-learning. Double DQN decouples action selection and Q-value evaluation, and Prioritized Memory ensures that important transitions (with higher TD errors) are more frequently used to correct Q-value estimates.
Summary¶
Prioritized Memory is a technique that enhances the traditional experience replay by sampling transitions based on their importance rather than uniformly. This approach helps in more efficient and stable learning by focusing on valuable experiences, improving exploration, and reducing overestimation bias. When used with Double DQN, Prioritized Memory contributes to better training stability and faster convergence by ensuring that critical transitions are replayed more frequently and bias in Q-value updates is corrected.
class PrioritizedMemory:
def __init__(self, capacity, alpha=0.6):
self.capacity = capacity
self.alpha = alpha
self.memory = deque(maxlen=capacity)
self.priorities = deque(maxlen=capacity)
self.epsilon = 1e-5 # Small positive value to avoid zero priorities
def update(self, transition, priority=1.0):
if len(self.memory) < self.capacity:
self.memory.append(transition)
self.priorities.append(priority)
else:
self.memory.popleft() # Remove oldest transition
self.priorities.popleft() # Remove oldest priority
self.memory.append(transition)
self.priorities.append(priority)
def sample(self, batch_size):
probabilities = np.array(self.priorities) ** self.alpha
probabilities /= probabilities.sum()
indices = np.random.choice(len(self.memory), batch_size, p=probabilities)
return [self.memory[i] for i in indices], indices
def update_priorities(self, indices, priorities):
for idx, priority in zip(indices, priorities):
self.priorities[idx] = priority + self.epsilon
class Net(tf.keras.Model):
# constructor initializes the layers
def __init__(self, input_size, num_actions):
super(Net, self).__init__()
self.dense1 = Dense(128, activation='relu', input_shape=(input_size,))
self.dense2 = Dense(128, activation='relu')
self.output_layer = Dense(num_actions, activation='linear')
'''
This function is the forward pass of the model. It takes the state as input and returns the Q values for each action.
'x' is the input to the model, which is the state from the environment. 'x' is then passed through the 2 dense
layers and the output layer to get the predicted values for each action. We use a linear activation function 'relu' to
predict a wide range of values, to estimate action values in the RL model
'''
def call(self, x):
# first dense layer
x = self.dense1(x)
# second dense layer
x = self.dense2(x)
return self.output_layer(x)
class Agent:
# this constructor initializes the environment, model, memory, and other variables required for the agent
def __init__(self, env_string, num_actions, state_size, batch_size=32, learning_rate=0.01, gamma=0.98, epsilon=1.0, epsilon_decay=0.98, epsilon_min=0.01, tau=0.01, memory_capacity=10000):
self.env_string = env_string
self.env = gym.make(env_string)
self.env.reset()
self.state_size = state_size
self.num_actions = num_actions
self.action_scope = [i * 4.0 / (num_actions - 1) - 2.0 for i in range(num_actions)] # Adjusted to match action scaling in dueling DQN
self.batch_size = batch_size
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
self.tau = tau
self.memory = PrioritizedMemory(memory_capacity)
self.best_total_reward = float('-inf')
self.model = Net(state_size, num_actions)
self.target_model = Net(state_size, num_actions)
self.optimizer = Adam(learning_rate)
self.loss_fn = tf.losses.MeanSquaredError()
'''
Implements the epsilon-greedy policy. With probability epsilon, a random action is selected, otherwise the action chosen will
be the one with the highest Q value. The epsilon value is decayed over time to reduce the exploration as the agent learns.
It also converts the state to a tensor before passing through to the model.
'''
def select_action(self, state):
# if random number is less than epsilon, return random action
if np.random.rand() < self.epsilon:
return np.random.randint(self.num_actions), None
# convert to tensor
state = tf.convert_to_tensor([state], dtype=tf.float32)
q_values = self.model(state)
# else, return action with highest Q value
return np.argmax(q_values.numpy()), None
# Store experience tuple into the memory buffer, function is above
def store_transition(self, state, action, reward, next_state, priority=1.0):
self.memory.update((state, action, reward, next_state), priority)
'''
Learn function performs single step training on a batch of experiences sampled from the memory buffer.
'''
def learn(self):
if len(self.memory.memory) < self.batch_size:
return
# Sample a batch of transitions and their indices
transitions, indices = self.memory.sample(self.batch_size)
# Extract the states, actions, rewards, and next states from the batch
state_batch, action_batch, reward_batch, next_state_batch = zip(*transitions)
state_batch = tf.convert_to_tensor(state_batch, dtype=tf.float32)
action_batch = tf.convert_to_tensor(action_batch, dtype=tf.int32)
reward_batch = tf.convert_to_tensor(reward_batch, dtype=tf.float32)
next_state_batch = tf.convert_to_tensor(next_state_batch, dtype=tf.float32)
with tf.GradientTape() as tape:
q_values = self.model(state_batch)
action_indices = tf.range(self.batch_size, dtype=tf.int32) * self.num_actions + action_batch
predicted_q = tf.gather(tf.reshape(q_values, [-1]), action_indices)
next_q_values = self.target_model(next_state_batch)
next_action_indices = tf.argmax(self.model(next_state_batch), axis=1, output_type=tf.int32)
next_q_values_target = tf.gather(tf.reshape(next_q_values, [-1]),
tf.range(self.batch_size, dtype=tf.int32) * self.num_actions + next_action_indices)
target_q = reward_batch + self.gamma * next_q_values_target
loss = self.loss_fn(target_q, predicted_q)
grads = tape.gradient(loss, self.model.trainable_variables)
# clipped_grads = [tf.clip_by_value(grad, -1.0, 1.0) for grad in grads]
self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))
# Compute TD errors and update priorities
td_errors = tf.abs(target_q - predicted_q).numpy()
self.memory.update_priorities(indices, td_errors + 1e-5) # Add small epsilon to avoid zero priorities
self.update_epsilon()
# update the epsilon value
def update_epsilon(self):
self.epsilon = max(self.epsilon_min, self.epsilon_decay * self.epsilon)
# update the model
def update_target_model(self):
main_weights = self.model.get_weights()
target_weights = self.target_model.get_weights()
for i in range(len(target_weights)):
target_weights[i] = self.tau * main_weights[i] + (1 - self.tau) * target_weights[i]
self.target_model.set_weights(target_weights)
# checks if current model is the best model based on total reward
def is_best_model(self, total_reward):
return total_reward > self.best_total_reward
# updates the best model with the current model
def update_best_model(self, total_reward):
self.best_total_reward = total_reward
# Main function to run the training
def main():
env_string = 'Pendulum-v0'
num_actions = 5
state_size = gym.make(env_string).observation_space.shape[0]
agent = Agent(env_string, num_actions, state_size)
episodes = 25
rewards = [] # Initialize rewards list to store total rewards per episode
for ep in range(episodes):
state = agent.env.reset()
done = False
total_reward = 0
# Capture frames for GIF
frames = []
while not done:
# Render the environment and capture frames
frames.append(agent.env.render(mode='rgb_array'))
action_index, _ = agent.select_action(state)
action = [agent.action_scope[action_index]]
next_state, reward, done, _ = agent.env.step(action)
agent.store_transition(state, action_index, reward, next_state)
state = next_state
total_reward += reward
agent.learn()
agent.update_target_model()
# Adjust epsilon after each episode
agent.update_epsilon()
rewards.append(total_reward) # Append the total reward of the episode to the rewards list
print(f"Episode: {ep+1}, Total Reward: {total_reward:.2f}, Epsilon: {agent.epsilon:.2f}")
if agent.is_best_model(total_reward):
# Update and save the best model weights
agent.update_best_model(total_reward)
model_dir = 'Weights/DoubleDQN/PrioritizedReplay'
save_model_weights(agent, model_dir)
print("Best model weights saved.")
# Directory where you want to save the files
save_dir = 'training_animations/DoubleDQN/PrioritizedReplay'
# Create the directory if it doesn't exist
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Save frames as GIF
gif_filename = os.path.join(save_dir, f'episode_{ep+1}.gif')
imageio.mimsave(gif_filename, frames, duration=0.00333, loop=0) # Adjust duration as needed
agent.env.close()
return rewards
rewards = main()
Episode: 1, Total Reward: -1424.95, Epsilon: 0.03 Best model weights saved. Episode: 2, Total Reward: -1648.41, Epsilon: 0.01 Episode: 3, Total Reward: -1315.96, Epsilon: 0.01 Best model weights saved. Episode: 4, Total Reward: -1054.94, Epsilon: 0.01 Best model weights saved. Episode: 5, Total Reward: -1052.24, Epsilon: 0.01 Best model weights saved. Episode: 6, Total Reward: -1069.60, Epsilon: 0.01 Episode: 7, Total Reward: -1081.39, Epsilon: 0.01 Episode: 8, Total Reward: -503.01, Epsilon: 0.01 Best model weights saved. Episode: 9, Total Reward: -131.85, Epsilon: 0.01 Best model weights saved. Episode: 10, Total Reward: -256.92, Epsilon: 0.01 Episode: 11, Total Reward: -357.69, Epsilon: 0.01 Episode: 12, Total Reward: -124.95, Epsilon: 0.01 Best model weights saved. Episode: 13, Total Reward: -119.42, Epsilon: 0.01 Best model weights saved. Episode: 14, Total Reward: -233.52, Epsilon: 0.01 Episode: 15, Total Reward: -123.64, Epsilon: 0.01 Episode: 16, Total Reward: -120.59, Epsilon: 0.01 Episode: 17, Total Reward: -123.49, Epsilon: 0.01 Episode: 18, Total Reward: -115.72, Epsilon: 0.01 Best model weights saved. Episode: 19, Total Reward: -0.87, Epsilon: 0.01 Best model weights saved. Episode: 20, Total Reward: -125.09, Epsilon: 0.01 Episode: 21, Total Reward: -256.43, Epsilon: 0.01 Episode: 22, Total Reward: -119.42, Epsilon: 0.01 Episode: 23, Total Reward: -4.00, Epsilon: 0.01 Episode: 24, Total Reward: -127.75, Epsilon: 0.01 Episode: 25, Total Reward: -232.11, Epsilon: 0.01
plot_rewards(rewards)
compute_moving_average_and_plot(rewards)
Results
After incorporating Prioritized Experience Replay (PER), the new model shows significant improvements, merging the strengths of the base and gradient clipping Double DQN models. It successfully maintains the pendulum in an upright and steady position, reflecting the enhanced learning efficiency offered by PER.
The training process also benefits from greater stability. The Rewards plot and moving average plot exhibit smoother curves with reduced volatility, signifying a more consistent learning trajectory. This stability indicates that PER has helped the model focus on more informative transitions, leading to fewer drastic performance fluctuations and a more reliable approach to policy optimization.
Moreover, the rewards have notably improved, with the highest reward reaching around -0.87. This indicates excellent performance in the pendulum balancing task, showing that the model is nearing its goal of keeping the pendulum upright for extended periods. This success underscores the effectiveness of Prioritized Experience Replay in enhancing the agent's policy and overall performance. ce.
Running the Double DQN weights multiple times (Test the Double DQN Model) ¶
if __name__ == '__main__':
env_string = 'Pendulum-v0'
input_size = gym.make(env_string).observation_space.shape[0]
num_actions = 5 # Must match the number used during training
model_dir = 'Weights/DoubleDQN/PrioritizedReplay'
weights_path = os.path.join(model_dir, 'tensorflow_dqn_weights.h5')
# Ensure the model path exists
if os.path.exists(weights_path):
model = load_model(weights_path, input_size, num_actions)
rewards = test_model(env_string, model, 'DoubleDQN/Best', 20)
else:
print(f"Model weights not found in {weights_path}. Please ensure the correct path.")
Episode: 1, Total Reward: -127.59 Episode: 2, Total Reward: -400.45 Episode: 3, Total Reward: -120.78 Episode: 4, Total Reward: -274.49 Episode: 5, Total Reward: -240.44 Episode: 6, Total Reward: -118.02 Episode: 7, Total Reward: -4.11 Episode: 8, Total Reward: -229.02 Episode: 9, Total Reward: -119.32 Episode: 10, Total Reward: -127.01 Episode: 11, Total Reward: -124.72 Episode: 12, Total Reward: -251.66 Episode: 13, Total Reward: -252.73 Episode: 14, Total Reward: -124.81 Episode: 15, Total Reward: -257.87 Episode: 16, Total Reward: -405.24 Episode: 17, Total Reward: -125.50 Episode: 18, Total Reward: -125.33 Episode: 19, Total Reward: -126.71 Episode: 20, Total Reward: -377.37
plot_rewards(rewards)
display_gifs_in_grid('test_animations/DoubleDQN/Best')




















Results¶
The Double DQN reliably balances the pendulum in every one of the 20 trials (100%) and secures rewards between -4 and -400. This shows that the model is both highly effective and robust in maintaining control. While the varied rewards reflect differences in performance due to factors like starting conditions, the model's consistent success in balancing the pendulum highlights its reliability and effectiveness in achieving its objective.
Dueling DQN ¶
What is Dueling DQN¶
Dueling Deep Q-Networks (Dueling DQN) is an enhancement to the standard Deep Q-Learning (DQN) architecture, designed to improve the learning efficiency and performance of Q-Learning algorithms.
In standard DQN, a neural network is used to approximate the Q-value function, which estimates the expected return for a given state-action pair. The network learns to predict Q-values for all possible actions in a given state.
Dueling DQN modifies this architecture by separately estimating two key components: the state value function and the advantage function
State Value Function¶
The state value function quantifies the long-term benefit of being in a particular state within an environment. It represents the expected return or cumulative reward an agent can anticipate if it starts from a given state and follows a particular policy thereafter.
Advantage Function¶
The advantage function evaluate the relative value of taking a specific action in a given state compared to the average action in that state. It provides a measure of how much better or worse an action is compared to the baseline of the average action value.
Dueling DQN vs DQN¶
Unlike standard DQN, which directly estimates Q-values, Dueling DQN separates the Q-value estimation into two streams: a value stream and an advantage stream. The value stream estimates the overall value of being in a particular state, while the advantage stream calculates the relative advantage of each action in that state.
This separation allows Dueling DQN to learn more efficiently, especially in situations where the choice of action doesn't significantly impact the outcome. By explicitly estimating the state value function, Dueling DQN can identify valuable states without having to learn the effect of each possible action in those states. This leads to improved generalization across actions and often results in better performance compared to standard DQN.
Implementing Dueling DQN using Pytorch¶
For our Dueling DQN, we will be implementing it using Pytorch instead of TensorFlow¶
Using Pytorch allows our Dueling DQN to exacute faster
- PyTorch's dynamic computation graph allows for more efficient memory usage, thus allowing the model to train faster
It is easier for us to implement the dueling DQN using Pytorch
- PyTorch offers more flexibility and control over the model architecture and training loop. This can be particularly useful when implementing more complex architectures like Dueling DQN
References
Base Model ¶
Overview of the Agent Class¶
The Agent class in the code utilizes the Double DQN method to tackle the issue of Q-value overestimation by separating the processes of action selection and evaluation. It incorporates two distinct neural networks: eval_net for action selection and target_net for evaluating these actions. This decoupling helps in reducing bias and improving the accuracy of Q-value estimates.
Action Selection and Training Stability¶
The model employs an epsilon-greedy strategy for action selection, allowing the agent to balance exploration and exploitation as training progresses. Additionally, gradient clipping is used to stabilize the training process by preventing excessively large gradients, which can lead to instability. The experience replay buffer plays a crucial role in this setup, storing and sampling transitions to help the model learn from a diverse range of experiences.
Training Process and Model Enhancement¶
During training, the target network is updated periodically to synchronize with the evaluation network, a common practice in DQN training to maintain stability. The exploration rate is also adjusted to shift from exploration to exploitation over time. This implementation of Double DQN, combined with experience replay and dueling network architecture, forms a robust baseline model, providing a solid foundation for further improvements and comparisons with more advanced techniques.
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
# Hyperparameters for DQN algorithm
class Args:
gamma = 0.9 # discount factor
num_actions = 5
seed = 0
render = False
log_interval = 10
args = Args()
torch.manual_seed(args.seed)
np.random.seed(args.seed)
# Store the training records (episode number and reward)
TrainingRecord = namedtuple('TrainingRecord', ['ep', 'reward'])
# Store the transitions -> state (s), action (a), reward (r). next state (s_)
Transition = namedtuple('Transition', ['s', 'a', 'r', 's_'])
# Defines the neural network used by the DQN agent. Dense model with a linear activation function
class Net(nn.Module):
def __init__(self):
# initialize the base c;ass for pytorch
super(Net, self).__init__()
'''
dense fully connected layer with input size of 3 and output size of 100. Initial feature extractor from the
input state vector, in this case (Pendulum-v0) represents the continuous state space.
'''
self.fc = nn.Linear(3, 100)
'''
This layer maps the 100 dimension feature representation to a space representing the action values (Q-values) for
each possible action
'''
self.a_head = nn.Linear(100, args.num_actions)
'''
Another layer that also maps the 100 dimension feature representation to a single value (V). This is part of the dueling DQN
architecture concept, where the network separately estimates state values and advantage values.
'''
self.v_head = nn.Linear(100, 1)
# Forward pass combines state and action values to produce Q-values for Dueling DQN.
def forward(self, x):
# Using the tanh activation function
x = F.tanh(self.fc(x))
a = self.a_head(x) - self.a_head(x).mean(1, keepdim=True)
v = self.v_head(x)
# Computes state value (V) with advantage value (a) for each action to produce final Q-values.
action_scores = a + v
return action_scores
'''
This class stores experience transoitions encountered by the agent during training. This mechanism
is a core component of experience replay technique used in DQN networks to stablize and improve the learning process.
'''
class Memory():
data_pointer = 0 # keeps track of current index in the memory buffer where the next transition will be stored.
isfull = False # flag indicating whether memory buffer has been filled up at least once. Important to know whether memory buffer contains enough samples to start sampling or not
# initialize the memory buffer
def __init__(self, capacity):
self.memory = np.empty(capacity, dtype=object)
self.capacity = capacity
# method to add a new transition to the memory buffer
def update(self, transition):
# new transition is stored in the memory at the data pointer index, which was initialized above, starting from 0
self.memory[self.data_pointer] = transition
# increment the data pointer to next location in memory for the next transition
self.data_pointer += 1
# if the buffer is filled, the flag will indicate TRUE and the future updates will start overwriting the old transitions, as pointer will reset to 0
if self.data_pointer == self.capacity:
self.data_pointer = 0
self.isfull = True
# this method randomly samples a batch of transitions from the memory, and batch_size determines the size of the batch.
def sample(self, batch_size):
return np.random.choice(self.memory, batch_size)
'''
This Agent class performs the behaviour and learning algorithm of the DQN model (agent). This class has the methods
for action selection, memory storage, parameter saving, and the learning algorithm of the dueling DQN network. It initializes
the 2 neural networks (eval_net and target_net) for calculating the Q-values, and an optimizer for training the eval_net.
'''
class Agent():
'''
This action_list generates a list of actions the agent can take. Since a DQN model is meant for discrete action spaces, but the Pendulum-v0 is
a continuous action space, we will need to convert the action space to discrete. We can do that by pre-defining the actions the model can take, such
as an increment of values from -2 to 2. This allows the model to choose a specific number.
'''
action_list = [(i * 4 - 2,) for i in range(args.num_actions)]
max_grad_norm = 0.5
def __init__(self):
self.training_step = 0
# initial exploration rate is 1, which decays over time to encourage exploration at the start and exploitation of the learned policy later.
self.epsilon = 1
# the models
self.eval_net, self.target_net = Net().float(), Net().float()
# memory
self.memory = Memory(2000)
# optimizer
self.optimizer = optim.Adam(self.eval_net.parameters(), lr=1e-3)
'''
The select action method, uses the epsilon-greedy polict for action selection. With the probability of epsilon (e), it
selects a random action (exploration), and when the probability 1 - epsilon, it selects the action with the highest Q-value as predicted by the eval_net (exploitation).
'''
def select_action(self, state):
# Convert state from numpy array to pytorch tensor and process it to get action values
state = torch.from_numpy(state).float().unsqueeze(0)
if np.random.random() < self.epsilon:
action_index = np.random.randint(args.num_actions)
else:
# Updates policy network using gradiens from loss between predicted and target Q-values
probs = self.eval_net(state)
action_index = probs.max(1)[1].item()
return self.action_list[action_index], action_index
# save parameters
def save_param(self):
if not os.path.exists('Weights/DuelingDQN/Base'):
os.makedirs('Weights/DuelingDQN/Base')
torch.save(self.eval_net.state_dict(), 'Weights/DuelingDQN/Base/dqn_net_params.pkl')
# store transition in the memory buffer. Contains the s, a, r ,s_ for the model to learn
def store_transition(self, transition):
self.memory.update(transition)
'''
This update method performs a single update step on the eval_net model, using a random batch of transitions from the memory. This
method will
1. Sample a batch and convert them to pytorch tensors
2. Use a double dqn method to decouple selection and evaluation of actions,
which will help reduce overestimations of Q-values regularly seen in natural DQNs
3. Calculate the loss between predicted and target Q-values, and we use a smoothing L1 regularization
technique as the loss objective function
4. Perform backpropagation and optimization to update weights of eval_net
5. Update target_net every 200 steps
6. Decay exploration rate (epsilon) to slowly shift from exploration to exploitation
'''
def update(self):
self.training_step += 1
# take a batch from memory
transitions = self.memory.sample(32)
s = torch.tensor([t.s for t in transitions], dtype=torch.float)
a = torch.tensor([t.a for t in transitions], dtype=torch.long).view(-1, 1)
r = torch.tensor([t.r for t in transitions], dtype=torch.float).view(-1, 1)
s_ = torch.tensor([t.s_ for t in transitions], dtype=torch.float)
# natural dqn
# q_eval = self.eval_net(s).gather(1, a)
# with torch.no_grad():
# q_target = r + args.gamma * self.target_net(s_).max(1, keepdim=True)[0]
# double dqn method, to reduce overestimations of Q-values
with torch.no_grad():
a_ = self.eval_net(s_).max(1, keepdim=True)[1]
q_target = r + args.gamma * self.target_net(s_).gather(1, a_)
q_eval = self.eval_net(s).gather(1, a)
# optimizer
self.optimizer.zero_grad()
# loss smoothing using l1 regularization, and calculating the loss between predicted and target q-values
loss = F.smooth_l1_loss(q_eval, q_target)
loss.backward()
nn.utils.clip_grad_norm_(self.eval_net.parameters(), self.max_grad_norm)
self.optimizer.step()
# update the target_net every 200 steps to synchronise model with eval_net
if self.training_step % 200 == 0:
self.target_net.load_state_dict(self.eval_net.state_dict())
# epsilon decay rate to move from exploration to exploitation
self.epsilon = max(self.epsilon * 0.999, 0.01)
return q_eval.mean().item()
# Main function to call all our above functions for the model to run
def main():
env = gym.make('Pendulum-v0')
env.seed(args.seed)
agent = Agent()
episodes, rewards = [], [] # Initialize empty lists to store episodes and rewards
running_reward, running_q = -1000, 0
for i_ep in range(100):
score = 0
state = env.reset()
# Store the Frames
frames = []
for t in range(200):
# Render the environment and capture frames
frames.append(env.render(mode='rgb_array'))
action, action_index = agent.select_action(state)
state_, reward, done, _ = env.step(action)
score += reward
if args.render:
env.render()
agent.store_transition(Transition(state, action_index, (reward + 8) / 8, state_))
state = state_
if agent.memory.isfull:
q = agent.update()
running_q = 0.99 * running_q + 0.01 * q
running_reward = running_reward * 0.9 + score * 0.1
episodes.append(i_ep)
rewards.append(running_reward)
if i_ep % args.log_interval == 0:
print('Ep {}\tAverage score: {:.2f}\tAverage Q: {:.2f}'.format(i_ep, running_reward, running_q))
if running_reward > -200:
print("Solved! Running reward is now {}!".format(running_reward))
agent.save_param()
break
# Directory where you want to save the files
save_dir = 'training_animations/DuelingDQN/Base'
# Create the directory if it doesn't exist
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Save frames as GIF
gif_filename = os.path.join(save_dir, f'episode_{i_ep+1}.gif')
imageio.mimsave(gif_filename, frames, duration=0.00333, loop=0) # Adjust duration as needed
env.close()
return episodes, rewards
episodes, rewards = main()
Ep 0 Average score: -1066.74 Average Q: 0.00 Ep 10 Average score: -1273.12 Average Q: 0.37 Ep 20 Average score: -1134.14 Average Q: 2.64 Ep 30 Average score: -680.52 Average Q: 6.33 Ep 40 Average score: -451.42 Average Q: 7.89 Ep 50 Average score: -309.87 Average Q: 8.72 Ep 60 Average score: -349.83 Average Q: 8.11 Ep 70 Average score: -230.87 Average Q: 9.18 Solved! Running reward is now -192.52134222194124!
plot_rewards(rewards)
compute_moving_average_and_plot(rewards)
GIF of highest reward attempt¶
Results¶
The Dueling DQN model shows significant learning progress, with rewards steadily increasing and eventually stabilizing. However, the minor spikes observed in the reward and moving average plots towards the end of training indicate occasional variability.
The model's ability to maintain the pendulum in an upright position, as observed from the rewards and the GIF, suggests that it has effectively learned the optimal policy.
Weight Decay ¶
What changed from the base model?¶
Weight Decay Application: Weight decay is a form of regularization applied during the optimization process. It adds a penalty to the loss function proportional to the sum of the squared weights (L2 regularization). In the code, this is done through the weight_decay parameter in the Adam optimizer. Parameter Setting: The weight_decay=1e-5 specifies the strength of the regularization. This small value indicates that the regularization effect is subtle but present.
How this changes help?¶
Prevents Overfitting: Weight decay discourages the network from learning excessively large weights, which helps in improving generalization to unseen data. By penalizing large weights, it forces the model to learn more general features and avoid overfitting. Stabilizes Training: Weight decay introduces a regularization term that moderates weight changes, which can stabilize the training process. This makes the optimization less sensitive to large gradients and helps in maintaining a more controlled training trajectory. Overall Impact: By including weight decay, the training process is less likely to produce overly large weights and more likely to converge to a solution that generalizes well. This subtle regularization helps in balancing the trade-off between fitting the training data and maintaining generalization.
In summary, adding weight decay helps to manage the complexity of the neural network by penalizing large weights, which can enhance generalization, stabilize training, and prevent overfitting. This change contributes to a more robust and reliable DQN implementation.
import os
os.environ["KMP_DUPLICATE_LIB_OK"]="TRUE"
# Hyperparameters for DQN algorithm
class Args:
gamma = 0.9 # discount factor
num_actions = 5
seed = 0
render = False
log_interval = 10
args = Args()
torch.manual_seed(args.seed)
np.random.seed(args.seed)
# Store the training records (episode number and reward)
TrainingRecord = namedtuple('TrainingRecord', ['ep', 'reward'])
# Store the transitions -> state (s), action (a), reward (r). next state (s_)
Transition = namedtuple('Transition', ['s', 'a', 'r', 's_'])
# Defines the neural network used by the DQN agent. Dense model with a linear activation function
class Net(nn.Module):
def __init__(self):
# initialize the base c;ass for pytorch
super(Net, self).__init__()
'''
dense fully connected layer with input size of 3 and output size of 100. Initial feature extractor from the
input state vector, in this case (Pendulum-v0) represents the continuous state space.
'''
self.fc = nn.Linear(3, 100)
'''
This layer maps the 100 dimension feature representation to a space representing the action values (Q-values) for
each possible action
'''
self.a_head = nn.Linear(100, args.num_actions)
'''
Another layer that also maps the 100 dimension feature representation to a single value (V). This is part of the dueling DQN
architecture concept, where the network separately estimates state values and advantage values.
'''
self.v_head = nn.Linear(100, 1)
# Forward pass combines state and action values to produce Q-values for Dueling DQN.
def forward(self, x):
# Using the tanh activation function
x = F.tanh(self.fc(x))
a = self.a_head(x) - self.a_head(x).mean(1, keepdim=True)
v = self.v_head(x)
# Computes state value (V) with advantage value (a) for each action to produce final Q-values.
action_scores = a + v
return action_scores
'''
This class stores experience transoitions encountered by the agent during training. This mechanism
is a core component of experience replay technique used in DQN networks to stablize and improve the learning process.
'''
class Memory():
data_pointer = 0 # keeps track of current index in the memory buffer where the next transition will be stored.
isfull = False # flag indicating whether memory buffer has been filled up at least once. Important to know whether memory buffer contains enough samples to start sampling or not
# initialize the memory buffer
def __init__(self, capacity):
self.memory = np.empty(capacity, dtype=object)
self.capacity = capacity
# method to add a new transition to the memory buffer
def update(self, transition):
# new transition is stored in the memory at the data pointer index, which was initialized above, starting from 0
self.memory[self.data_pointer] = transition
# increment the data pointer to next location in memory for the next transition
self.data_pointer += 1
# if the buffer is filled, the flag will indicate TRUE and the future updates will start overwriting the old transitions, as pointer will reset to 0
if self.data_pointer == self.capacity:
self.data_pointer = 0
self.isfull = True
# this method randomly samples a batch of transitions from the memory, and batch_size determines the size of the batch.
def sample(self, batch_size):
return np.random.choice(self.memory, batch_size)
'''
This Agent class performs the behaviour and learning algorithm of the DQN model (agent). This class has the methods
for action selection, memory storage, parameter saving, and the learning algorithm of the dueling DQN network. It initializes
the 2 neural networks (eval_net and target_net) for calculating the Q-values, and an optimizer for training the eval_net.
'''
class Agent():
'''
This action_list generates a list of actions the agent can take. Since a DQN model is meant for discrete action spaces, but the Pendulum-v0 is
a continuous action space, we will need to convert the action space to discrete. We can do that by pre-defining the actions the model can take, such
as an increment of values from -2 to 2. This allows the model to choose a specific number.
'''
action_list = [(i * 4 - 2,) for i in range(args.num_actions)]
max_grad_norm = 0.5
def __init__(self):
self.training_step = 0
# initial exploration rate is 1, which decays over time to encourage exploration at the start and exploitation of the learned policy later.
self.epsilon = 1
# the models
self.eval_net, self.target_net = Net().float(), Net().float()
# memory
self.memory = Memory(2000)
# optimizer
self.optimizer = optim.Adam(self.eval_net.parameters(), lr=1e-3, weight_decay=1e-5)
'''
The select action method, uses the epsilon-greedy polict for action selection. With the probability of epsilon (e), it
selects a random action (exploration), and when the probability 1 - epsilon, it selects the action with the highest Q-value as predicted by the eval_net (exploitation).
'''
def select_action(self, state):
# Convert state from numpy array to pytorch tensor and process it to get action values
state = torch.from_numpy(state).float().unsqueeze(0)
if np.random.random() < self.epsilon:
action_index = np.random.randint(args.num_actions)
else:
# Updates policy network using gradiens from loss between predicted and target Q-values
probs = self.eval_net(state)
action_index = probs.max(1)[1].item()
return self.action_list[action_index], action_index
# save parameters
def save_param(self):
if not os.path.exists('Weights/DuelingDQN/weightDecay'):
os.makedirs('Weights/DuelingDQN/weightDecay')
torch.save(self.eval_net.state_dict(), 'Weights/DuelingDQN/weightDecay/dqn_net_params.pkl')
# store transition in the memory buffer. Contains the s, a, r ,s_ for the model to learn
def store_transition(self, transition):
self.memory.update(transition)
'''
This update method performs a single update step on the eval_net model, using a random batch of transitions from the memory. This
method will
1. Sample a batch and convert them to pytorch tensors
2. Use a double dqn method to decouple selection and evaluation of actions,
which will help reduce overestimations of Q-values regularly seen in natural DQNs
3. Calculate the loss between predicted and target Q-values, and we use a smoothing L1 regularization
technique as the loss objective function
4. Perform backpropagation and optimization to update weights of eval_net
5. Update target_net every 200 steps
6. Decay exploration rate (epsilon) to slowly shift from exploration to exploitation
'''
def update(self):
self.training_step += 1
# take a batch from memory
transitions = self.memory.sample(32)
s = torch.tensor([t.s for t in transitions], dtype=torch.float)
a = torch.tensor([t.a for t in transitions], dtype=torch.long).view(-1, 1)
r = torch.tensor([t.r for t in transitions], dtype=torch.float).view(-1, 1)
s_ = torch.tensor([t.s_ for t in transitions], dtype=torch.float)
# natural dqn
# q_eval = self.eval_net(s).gather(1, a)
# with torch.no_grad():
# q_target = r + args.gamma * self.target_net(s_).max(1, keepdim=True)[0]
# double dqn method, to reduce overestimations of Q-values
with torch.no_grad():
a_ = self.eval_net(s_).max(1, keepdim=True)[1]
q_target = r + args.gamma * self.target_net(s_).gather(1, a_)
q_eval = self.eval_net(s).gather(1, a)
# optimizer
self.optimizer.zero_grad()
# loss smoothing using l1 regularization, and calculating the loss between predicted and target q-values
loss = F.smooth_l1_loss(q_eval, q_target)
loss.backward()
nn.utils.clip_grad_norm_(self.eval_net.parameters(), self.max_grad_norm)
self.optimizer.step()
# update the target_net every 200 steps to synchronise model with eval_net
if self.training_step % 200 == 0:
self.target_net.load_state_dict(self.eval_net.state_dict())
# epsilon decay rate to move from exploration to exploitation
self.epsilon = max(self.epsilon * 0.999, 0.01)
return q_eval.mean().item()
# Main function to call all our above functions for the model to run
def main():
env = gym.make('Pendulum-v0')
env.seed(args.seed)
agent = Agent()
episodes, rewards = [], [] # Initialize empty lists to store episodes and rewards
running_reward, running_q = -1000, 0
for i_ep in range(100):
score = 0
state = env.reset()
# Store the Frames
frames = []
for t in range(200):
# Render the environment and capture frames
frames.append(env.render(mode='rgb_array'))
action, action_index = agent.select_action(state)
state_, reward, done, _ = env.step(action)
score += reward
if args.render:
env.render()
agent.store_transition(Transition(state, action_index, (reward + 8) / 8, state_))
state = state_
if agent.memory.isfull:
q = agent.update()
running_q = 0.99 * running_q + 0.01 * q
running_reward = running_reward * 0.9 + score * 0.1
episodes.append(i_ep)
rewards.append(running_reward)
if i_ep % args.log_interval == 0:
print('Ep {}\tAverage score: {:.2f}\tAverage Q: {:.2f}'.format(i_ep, running_reward, running_q))
if running_reward > -200:
print("Solved! Running reward is now {}!".format(running_reward))
agent.save_param()
break
# Directory where you want to save the files
save_dir = 'training_animations/DuelingDQN/weightDecay'
# Create the directory if it doesn't exist
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Save frames as GIF
gif_filename = os.path.join(save_dir, f'episode_{i_ep+1}.gif')
imageio.mimsave(gif_filename, frames, duration=0.00333, loop=0) # Adjust duration as needed
env.close()
return episodes, rewards
episodes, rewards = main()
Ep 0 Average score: -1066.74 Average Q: 0.00 Ep 10 Average score: -1273.12 Average Q: 0.37 Ep 20 Average score: -1233.21 Average Q: 2.08 Ep 30 Average score: -644.33 Average Q: 6.98 Ep 40 Average score: -484.88 Average Q: 7.60 Ep 50 Average score: -266.06 Average Q: 8.97 Ep 60 Average score: -242.08 Average Q: 8.79 Solved! Running reward is now -196.55193511670265!
plot_rewards(rewards)
compute_moving_average_and_plot(rewards)
GIF of highest reward attempt¶
Results¶
After implementing weight decay, the model exhibits similar trends to the previous model, with rewards steadily increasing and eventually stabilizing as training progresses. This consistent rise in rewards suggests that the model is effectively learning and improving its performance.
Additionally, the model demonstrates a strong ability to balance itself, as evidenced by the high rewards achieved. The accompanying GIF shows the model successfully maintaining the pendulum in an upright position, indicating that it has learned an optimal control policy.
In terms of stability, the implementation of weight decay has had a positive impact. The rewards graph shows significantly fewer spikes compared to the previous model, indicating a smoother and more consistent learning process. This reduction in fluctuations suggests that weight decay has helped regularize the model, mitigating the risk of overfitting and leading to more stable training outcomes.
Boltzzman Exploration Policy ¶
What is Boltzzman Exploration Policy¶
The Boltzmann exploration policy is a strategy used in reinforcement learning. It converts the values assigned to each possible action (such as Q-values) into a probability distribution using a softmax function. The action to be taken is then sampled from this distribution. This method allows for a balance between exploration and exploitation by adjusting the steepness of the softmax function through a temperature parameter, which can be scheduled over time
How Boltzzman Exploration Policy works¶
Boltzzman Exploration Policy works by converting the estimated values of different actions into a probability distribution.
The process begins by assigning a value to each possible action, typically based on the expected reward. These values are then transformed into probabilities using a mathematical function called softmax.
A key component of this policy is the temperature parameter, which controls the balance between exploration and exploitation.
When the temperature is high, the probabilities of choosing different actions become more similar, encouraging exploration. Conversely, when the temperature is low, actions with higher estimated values are more likely to be chosen, favoring exploitation. The policy then selects an action based on these calculated probabilities, introducing a element of randomness to the decision-making process. This approach allows the agent to systematically explore different actions while gradually focusing on those that appear most rewarding.
How Boltzzman Exploration Helps with stability¶
Boltzmann exploration allows for a smooth transition between exploration and exploitation. Unlike ε-greedy, where exploration is done with a fixed probability, Boltzmann exploration adjusts the exploration based on the Q-values, which helps prevent the agent from becoming stuck in suboptimal policies due to excessive or insufficient exploration.
As the Q-values are updated and become more accurate, the Boltzmann distribution adjusts the probabilities of selecting actions accordingly. This adaptiveness helps in stabilizing the learning process as the policy evolves over time.
Boltzzman Exploration Policy VS Epsilon Greedy Policy¶
Action selection mechanism¶
- Epsilon-greedy: Chooses the best-known action with probability 1-ε, and a random action with probability ε.
- Assigns probabilities to all actions based on their estimated values, using a softmax function.
Exploration approach¶
- Epsilon-greedy: Exploration is purely random when it occurs.
- Boltzmann: Exploration is guided by the relative values of actions, favoring more promising options.
Parameter control:¶
- Controlled by a single parameter ε, which directly represents the exploration probability.
- Controlled by a temperature parameter τ, which indirectly affects exploration by altering action probabilities.
Transition to exploitation¶
- Epsilon-greedy: Often involves gradually decreasing ε over time
- Boltzmann: Involves decreasing the temperature parameter τ over time
References
# Hyperparameters for DQN algorithm
class Args:
gamma = 0.9 # discount factor
num_actions = 5
seed = 0
render = False
log_interval = 10
args = Args()
torch.manual_seed(args.seed)
np.random.seed(args.seed)
# Store the training records (episode number and reward)
TrainingRecord = namedtuple('TrainingRecord', ['ep', 'reward'])
# Store the transitions -> state (s), action (a), reward (r). next state (s_)
Transition = namedtuple('Transition', ['s', 'a', 'r', 's_'])
# Defines the neural network used by the DQN agent. Dense model with a linear activation function
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc = nn.Linear(3, 100)
self.a_head = nn.Linear(100, args.num_actions)
self.v_head = nn.Linear(100, 1)
def forward(self, x):
x = F.leaky_relu(self.fc(x), negative_slope=0.01) # Using LeakyReLU
a = self.a_head(x) - self.a_head(x).mean(1, keepdim=True)
v = self.v_head(x)
action_scores = a + v
return action_scores
'''
This class stores experience transoitions encountered by the agent during training. This mechanism
is a core component of experience replay technique used in DQN networks to stablize and improve the learning process.
'''
class Memory():
data_pointer = 0 # keeps track of current index in the memory buffer where the next transition will be stored.
isfull = False # flag indicating whether memory buffer has been filled up at least once. Important to know whether memory buffer contains enough samples to start sampling or not
# initialize the memory buffer
def __init__(self, capacity):
self.memory = np.empty(capacity, dtype=object)
self.capacity = capacity
# method to add a new transition to the memory buffer
def update(self, transition):
# new transition is stored in the memory at the data pointer index, which was initialized above, starting from 0
self.memory[self.data_pointer] = transition
# increment the data pointer to next location in memory for the next transition
self.data_pointer += 1
# if the buffer is filled, the flag will indicate TRUE and the future updates will start overwriting the old transitions, as pointer will reset to 0
if self.data_pointer == self.capacity:
self.data_pointer = 0
self.isfull = True
# this method randomly samples a batch of transitions from the memory, and batch_size determines the size of the batch.
def sample(self, batch_size):
return np.random.choice(self.memory, batch_size)
'''
This Agent class performs the behaviour and learning algorithm of the DQN model (agent). This class has the methods
for action selection, memory storage, parameter saving, and the learning algorithm of the dueling DQN network. It initializes
the 2 neural networks (eval_net and target_net) for calculating the Q-values, and an optimizer for training the eval_net.
'''
class Agent():
'''
This action_list generates a list of actions the agent can take. Since a DQN model is meant for discrete action spaces, but the Pendulum-v0 is
a continuous action space, we will need to convert the action space to discrete. We can do that by pre-defining the actions the model can take, such
as an increment of values from -2 to 2. This allows the model to choose a specific number.
'''
action_list = [(i * 4 - 2,) for i in range(args.num_actions)]
max_grad_norm = 0.5
def __init__(self):
self.training_step = 0
# initial exploration rate is 1, which decays over time to encourage exploration at the start and exploitation of the learned policy later.
self.epsilon = 1
# the models
self.eval_net, self.target_net = Net().float(), Net().float()
# memory
self.memory = Memory(2000)
# optimizer
self.optimizer = optim.Adam(self.eval_net.parameters(), lr=1e-3, weight_decay=1e-5) # Added weight_decay for L2 regularization
# Initialize temperature for Boltzmann exploration
self.temperature = 1.0
'''
Boltzmann exploration method for action selection. Q-values computed are transformed into a probability distribution
using a softmax function. The temperature parameter controls the stochasticity of the action selection process.
When the temperature is high, the action probabilities become more uniform, increasing exploration.
When the temperature is low, the action probabilities become more skewed towards the action with the highest Q-value, increasing exploitation.
'''
def select_action(self, state):
state = torch.from_numpy(state).float().unsqueeze(0)
q_values = self.eval_net(state).detach() # Get Q-values for all actions
probabilities = F.softmax(q_values / self.temperature, dim=-1).numpy() # Apply softmax to convert Q-values to probabilities
action_index = np.random.choice(np.arange(args.num_actions), p=probabilities.ravel()) # Choose action based on probabilities
return self.action_list[action_index], action_index
# save parameters
def save_param(self):
if not os.path.exists('Weights/DuelingDQN/Boltzzman'):
os.makedirs('Weights/DuelingDQN/Boltzzman')
torch.save(self.eval_net.state_dict(), 'Weights/DuelingDQN/Boltzzman/dqn_improved_params.pkl')
# store transition in the memory buffer. Contains the s, a, r ,s_ for the model to learn
def store_transition(self, transition):
self.memory.update(transition)
'''
This update method performs a single update step on the eval_net model, using a random batch of transitions from the memory. This
method will
1. Sample a batch and convert them to pytorch tensors
2. Use a double dqn method to decouple selection and evaluation of actions,
which will help reduce overestimations of Q-values regularly seen in natural DQNs
3. Calculate the loss between predicted and target Q-values, and we use a smoothing L1 regularization
technique as the loss objective function
4. Perform backpropagation and optimization to update weights of eval_net
5. Update target_net every 200 steps
6. Decay exploration rate (epsilon) to slowly shift from exploration to exploitation
'''
def update(self):
self.training_step += 1
# take a batch from memory
transitions = self.memory.sample(32)
s = torch.tensor([t.s for t in transitions], dtype=torch.float)
a = torch.tensor([t.a for t in transitions], dtype=torch.long).view(-1, 1)
r = torch.tensor([t.r for t in transitions], dtype=torch.float).view(-1, 1)
s_ = torch.tensor([t.s_ for t in transitions], dtype=torch.float)
# natural dqn
# q_eval = self.eval_net(s).gather(1, a)
# with torch.no_grad():
# q_target = r + args.gamma * self.target_net(s_).max(1, keepdim=True)[0]
# double dqn method, to reduce overestimations of Q-values
with torch.no_grad():
a_ = self.eval_net(s_).max(1, keepdim=True)[1]
q_target = r + args.gamma * self.target_net(s_).gather(1, a_)
q_eval = self.eval_net(s).gather(1, a)
# optimizer
self.optimizer.zero_grad()
# loss smoothing using l1 regularization, and calculating the loss between predicted and target q-values
loss = F.smooth_l1_loss(q_eval, q_target)
loss.backward()
nn.utils.clip_grad_norm_(self.eval_net.parameters(), self.max_grad_norm)
self.optimizer.step()
# update the target_net every 200 steps to synchronise model with eval_net
if self.training_step % 200 == 0:
self.target_net.load_state_dict(self.eval_net.state_dict())
# epsilon decay rate to move from exploration to exploitation
self.epsilon = max(self.epsilon * 0.999, 0.01)
self.temperature = max(self.temperature * 0.995, 0.01) # Decrease temperature each episode or step
return q_eval.mean().item()
# Main function to call all our above functions for the model to run
def main():
env = gym.make('Pendulum-v0')
env.seed(args.seed)
agent = Agent()
episodes, rewards = [], [] # Initialize empty lists to store episodes and rewards
running_reward, running_q = -1000, 0
for i_ep in range(100):
score = 0
state = env.reset()
# Save the frames
frames = []
for t in range(200):
# Render the environment and capture frames
frames.append(env.render(mode='rgb_array'))
action, action_index = agent.select_action(state)
state_, reward, done, _ = env.step(action)
score += reward
if args.render:
env.render()
agent.store_transition(Transition(state, action_index, (reward + 8) / 8, state_))
state = state_
if agent.memory.isfull:
q = agent.update()
running_q = 0.99 * running_q + 0.01 * q
running_reward = running_reward * 0.9 + score * 0.1
episodes.append(i_ep)
rewards.append(running_reward)
if i_ep % args.log_interval == 0:
print('Ep {}\tAverage score: {:.2f}\tAverage Q: {:.2f}'.format(i_ep, running_reward, running_q))
if running_reward > -10:
print("Solved! Running reward is now {}!".format(running_reward))
break
# Directory where you want to save the files
save_dir = 'training_animations/DuelingDQN/BotlzmannExploration'
# Create the directory if it doesn't exist
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Save frames as GIF
gif_filename = os.path.join(save_dir, f'episode_{i_ep+1}.gif')
imageio.mimsave(gif_filename, frames, duration=0.00333, loop=0) # Adjust duration as needed
env.close()
agent.save_param()
return episodes, rewards # Return the collected episodes and rewards
# Ensure this line is at the end of the script to call main and collect data
episodes, rewards = main()
Ep 0 Average score: -1060.49 Average Q: 0.00 Ep 10 Average score: -1288.19 Average Q: 0.32 Ep 20 Average score: -1067.92 Average Q: 2.67 Ep 30 Average score: -700.46 Average Q: 6.01 Ep 40 Average score: -451.28 Average Q: 7.86 Ep 50 Average score: -260.35 Average Q: 9.09 Ep 60 Average score: -202.53 Average Q: 9.10 Ep 70 Average score: -170.75 Average Q: 9.30 Ep 80 Average score: -169.74 Average Q: 9.32 Ep 90 Average score: -170.78 Average Q: 9.10
plot_rewards(rewards)
compute_moving_average_and_plot(rewards)
GIF of highest reward attempt¶
Results¶
As seen from the GIF, the model is able to balance the pendulum effectively. When the pendulum reaches the upright position, the model consistently maintains its balance, demonstrating its ability to apply the correct amount of torque (Action) to keep the pendulum steady.
In terms of stability after implementing the Boltzmann exploration policy, we can observe that while there are still spikes in the reward and moving average curves, these spikes are less frequent. This indicates a smoother and more stable learning process compared to earlier configurations.
In terms of rewards, the model is able to achieve a reward of -200 and successfully balance the pendulum. The lower reward may be attributed to the initial starting position of the pendulum, which begins from a low position. This starting condition can negatively impact the initial reward, as the pendulum has to perform additional work to reach and maintain the balanced position.
Running the Dueling DQN weights multiple times (Test the Dueling DQN Model) ¶
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.fc = nn.Linear(3, 100)
self.a_head = nn.Linear(100, args.num_actions)
self.v_head = nn.Linear(100, 1)
def forward(self, x):
x = F.leaky_relu(self.fc(x), negative_slope=0.01) # Using LeakyReLU
a = self.a_head(x) - self.a_head(x).mean(1, keepdim=True)
v = self.v_head(x)
action_scores = a + v
return action_scores
import torch
import gym
import numpy as np
class Args:
num_actions = 5 # Update this if necessary
args = Args()
model = Net()
model.load_state_dict(torch.load('Weights/DuelingDQN/Boltzzman/dqn_improved_params.pkl'))
model.eval() # Set the model to evaluation mode
def test_model(env_name='Pendulum-v0', num_episodes=20, render=True):
env = gym.make(env_name)
env.seed(123)
total_rewards = []
for episode in range(num_episodes):
state = env.reset()
episode_reward = 0
# Save the frames
frames = []
while True:
# Render the environment and capture frames
frames.append(env.render(mode='rgb_array'))
state_tensor = torch.from_numpy(state).float().unsqueeze(0)
with torch.no_grad():
action_index = model(state_tensor).max(1)[1].view(1, 1)
# Convert the action index to the actual action using your mapping
action = [(action_index.item() * 4 / (args.num_actions - 1)) - 2] # Adjust this based on your action mapping
next_state, reward, done, _ = env.step(action)
episode_reward += reward
if render:
env.render()
if done:
print(f"Episode: {episode + 1}, Total Reward: {episode_reward:.2f}")
total_rewards.append(episode_reward)
break
state = next_state
# Directory where you want to save the files
save_dir = 'test_animations/DuelingDQN/Best'
# Create the directory if it doesn't exist
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Save frames as GIF
gif_filename = os.path.join(save_dir, f'episode_{episode+1}.gif')
imageio.mimsave(gif_filename, frames, duration=0.00333, loop=0) # Adjust duration as needed
env.close()
print(f"Average Reward: {np.mean(total_rewards):.2f}")
return total_rewards
rewards = test_model(num_episodes=20,render=True)
Episode: 1, Total Reward: -251.23 Episode: 2, Total Reward: -127.94 Episode: 3, Total Reward: -931.68 Episode: 4, Total Reward: -932.81 Episode: 5, Total Reward: -244.43 Episode: 6, Total Reward: -128.74 Episode: 7, Total Reward: -248.52 Episode: 8, Total Reward: -252.76 Episode: 9, Total Reward: -246.23 Episode: 10, Total Reward: -623.10 Episode: 11, Total Reward: -749.77 Episode: 12, Total Reward: -6.21 Episode: 13, Total Reward: -130.12 Episode: 14, Total Reward: -130.58 Episode: 15, Total Reward: -1493.68 Episode: 16, Total Reward: -6.49 Episode: 17, Total Reward: -6.23 Episode: 18, Total Reward: -6.23 Episode: 19, Total Reward: -615.80 Episode: 20, Total Reward: -489.70 Average Reward: -381.11
plot_rewards(rewards)
display_gifs_in_grid('test_animations/DuelingDQN/Best')




















Results¶
The Dueling DQN successfully balances the pendulum in 18 out of 20 trials, with rewards ranging from -6 to -1400. This indicates that the model is generally reliable in maintaining balance. However, the wide range of rewards suggests that the model might not have fully mastered all aspects of the environment yet. While the performance is consistent, further refinement could enhance the model's ability to handle various scenarios more effectively.
Deep Deterministic Policy Gradient (DDPG) ¶
What is DDPG¶
Deep Deterministic Policy Gradient (DDPG) is a reinforcement learning algorithm tailored for continuous action spaces, such as those found in the pendulum problem (Pendulum-v0). Instead of generating probability distributions over actions like stochastic policies, DDPG utilizes a deterministic policy that directly outputs specific actions. This method generally involves deep neural networks to approximate both the policy function and the action-value function. By leveraging the Deterministic Policy Gradient theorem, the policy parameters are updated to increase the expected action-value.
How DDPG Works¶
The algorithm employs two neural networks: one to approximate the policy function, which takes observed states as inputs and produces corresponding actions, and another to estimate the action-value function, evaluating the quality of selected actions in given states.
DDPG leverages the Deterministic Policy Gradient theorem to update policy parameters, providing a gradient expression for adjusting the policy network's weights towards higher expected action-value. To balance exploration and exploitation, the algorithm adds noise to selected actions during the exploration phase.
Training involves using optimization methods like stochastic gradient descent to update both networks' parameters based on estimated gradients. The action-value function is refined to better approximate expected cumulative rewards, while the policy network is updated to generate actions that maximize the action-value. DDPG's approach of directly outputting deterministic actions offers advantages in continuous domains and tends to have lower variance in gradient estimates compared to stochastic policies.
The Actor Critic Architecture¶
DDPG employs an actor-critic framework where the actor and critic networks work together to optimize the policy and value functions.
Actor¶
The actor is responsible for determining the actions to take given the current state. It represents a deterministic policy function, typically approximated by a deep neural network. This network takes the observed state as input and outputs a specific action directly.
Critic¶
The critic evaluates the actions selected by the actor by estimating the action-value function Q(s,a). This function is also approximated by a deep neural network and is trained to predict the expected cumulative reward for taking a given action in a given state.
Training Process of DDPG¶
The training process of Deep Deterministic Policy Gradient (DDPG) is an iterative procedure that integrates exploration, experience collection, and network updates. The agent initiates interaction with the environment by employing the OU process to introduce temporally correlated noise to the actor's output, thereby promoting exploration within the continuous action space. As the agent interacts, it stores experiences (state, action, reward, next state) in the replay buffer.
During training, the agent samples a mini-batches of experiences from the replay buffer to perform network updates. These experiences are used to train the critic network by minimizing the loss between the predicted Q-values and the target Q-values. The target Q-values are computed using the target critic network and the Bellman equation, which helps provide stable targets for the Q-value updates.
Simultaneously, the actor network is updated using the deterministic policy gradient. This gradient is calculated based on the Q-values predicted by the critic network for the actions selected by the actor. The actor’s policy is adjusted to maximize these Q-values, thereby enhancing its performance. The target networks are slowly updated from the main networks through a soft update mechanism, ensuring stability in the training process.
What is Ornstein-Uhlenbeck Action Noise¶
Ornstein-Uhlenbeck (OU) action noise is a type of noise used in reinforcement learning, particularly in algorithms dealing with continuous action spaces, such as the Deep Deterministic Policy Gradient (DDPG). It is used to encourage exploration of the action space by adding noise to the actions taken by the agent.
The OU process generates temporally correlated noise, which is useful in environments where the optimal policy requires smooth and continuous actions.
How Ornstein-Uhlenbeck Action Noise works¶
The OU process facilitates exploration in continuous action spaces by introducing structured randomness to the agent's actions, allowing it to systematically explore the action space while maintaining a degree of consistency in its behaviour. This approach is more efficient than randomly sampling actions from the entire action space, which can be ineffective in high-dimensional continuous spaces.
Exploration VS Exploitation¶
The OU process facilitates Exploration and exploitation balance by gradually reducing the level of randomness as the agent's policy becomes more refined. Initially, the structured noise encourages extensive exploration of the action space, allowing the agent to discover and evaluate a variety of actions. As learning progresses and the agent gains more experience, the noise diminishes, enabling the agent to focus more on exploiting the actions that have proven effective. Thus, the OU process ensures that exploration is conducted in a systematic manner while progressively shifting the emphasis towards exploiting the learned policy.
References
https://huggingface.co/keras-io/deep-deterministic-policy-gradient
The DDPG Model ¶
problem = "Pendulum-v0"
env = gym.make(problem)
num_states = env.observation_space.shape[0]
print("Size of State Space -> {}".format(num_states))
num_actions = env.action_space.shape[0]
print("Size of Action Space -> {}".format(num_actions))
upper_bound = env.action_space.high[0]
lower_bound = env.action_space.low[0]
print("Max Value of Action -> {}".format(upper_bound))
print("Min Value of Action -> {}".format(lower_bound))
Size of State Space -> 3 Size of Action Space -> 1 Max Value of Action -> 2.0 Min Value of Action -> -2.0
class OUActionNoise:
def __init__(self, mean, std_deviation, theta=0.15, dt=1e-2, x_initial=None):
self.theta = theta
self.mean = mean
self.std_dev = std_deviation
self.dt = dt
self.x_initial = x_initial
self.reset()
def __call__(self):
x = (
self.x_prev
+ self.theta * (self.mean - self.x_prev) * self.dt
+ self.std_dev * np.sqrt(self.dt) * np.random.normal(size=self.mean.shape)
)
# Store x into x_prev
# Makes next noise dependent on current one
self.x_prev = x
return x
def reset(self):
if self.x_initial is not None:
self.x_prev = self.x_initial
else:
self.x_prev = np.zeros_like(self.mean)
class Buffer:
def __init__(self, buffer_capacity=100000, batch_size=64):
# Number of "experiences" to store at max
self.buffer_capacity = buffer_capacity
# Num of tuples to train on.
self.batch_size = batch_size
# Its tells us num of times record() was called.
self.buffer_counter = 0
# Instead of list of tuples as the exp.replay concept go
# We use different np.arrays for each tuple element
self.state_buffer = np.zeros((self.buffer_capacity, num_states))
self.action_buffer = np.zeros((self.buffer_capacity, num_actions))
self.reward_buffer = np.zeros((self.buffer_capacity, 1))
self.next_state_buffer = np.zeros((self.buffer_capacity, num_states))
# Takes (s,a,r,s') obervation tuple as input
def record(self, obs_tuple):
# Set index to zero if buffer_capacity is exceeded,
# replacing old records
index = self.buffer_counter % self.buffer_capacity
self.state_buffer[index] = obs_tuple[0]
self.action_buffer[index] = obs_tuple[1]
self.reward_buffer[index] = obs_tuple[2]
self.next_state_buffer[index] = obs_tuple[3]
self.buffer_counter += 1
# Eager execution is turned on by default in TensorFlow 2. Decorating with tf.function allows
# TensorFlow to build a static graph out of the logic and computations in our function.
# This provides a large speed up for blocks of code that contain many small TensorFlow operations such as this one.
@tf.function
def update(
self, state_batch, action_batch, reward_batch, next_state_batch,
):
# Training and updating Actor & Critic networks.
# See Pseudo Code.
with tf.GradientTape() as tape:
target_actions = target_actor(next_state_batch, training=True)
y = reward_batch + gamma * target_critic(
[next_state_batch, target_actions], training=True
)
critic_value = critic_model([state_batch, action_batch], training=True)
critic_loss = tf.math.reduce_mean(tf.math.square(y - critic_value))
critic_grad = tape.gradient(critic_loss, critic_model.trainable_variables)
critic_optimizer.apply_gradients(
zip(critic_grad, critic_model.trainable_variables)
)
with tf.GradientTape() as tape:
actions = actor_model(state_batch, training=True)
critic_value = critic_model([state_batch, actions], training=True)
# Used `-value` as we want to maximize the value given
# by the critic for our actions
actor_loss = -tf.math.reduce_mean(critic_value)
actor_grad = tape.gradient(actor_loss, actor_model.trainable_variables)
actor_optimizer.apply_gradients(
zip(actor_grad, actor_model.trainable_variables)
)
# We compute the loss and update parameters
def learn(self):
# Get sampling range
record_range = min(self.buffer_counter, self.buffer_capacity)
# Randomly sample indices
batch_indices = np.random.choice(record_range, self.batch_size)
# Convert to tensors
state_batch = tf.convert_to_tensor(self.state_buffer[batch_indices])
action_batch = tf.convert_to_tensor(self.action_buffer[batch_indices])
reward_batch = tf.convert_to_tensor(self.reward_buffer[batch_indices])
reward_batch = tf.cast(reward_batch, dtype=tf.float32)
next_state_batch = tf.convert_to_tensor(self.next_state_buffer[batch_indices])
self.update(state_batch, action_batch, reward_batch, next_state_batch)
# This update target parameters slowly
# Based on rate `tau`, which is much less than one.
@tf.function
def update_target(target_weights, weights, tau):
for (a, b) in zip(target_weights, weights):
a.assign(b * tau + a * (1 - tau))
def get_actor():
# Initialize weights between -3e-3 and 3-e3
last_init = tf.random_uniform_initializer(minval=-0.003, maxval=0.003)
inputs = layers.Input(shape=(num_states,))
out = layers.Dense(256, activation="relu")(inputs)
out = layers.Dense(256, activation="relu")(out)
outputs = layers.Dense(1, activation="tanh", kernel_initializer=last_init)(out)
# Our upper bound is 2.0 for Pendulum.
outputs = outputs * upper_bound
model = tf.keras.Model(inputs, outputs)
return model
def get_critic():
# State as input
state_input = layers.Input(shape=(num_states))
state_out = layers.Dense(16, activation="relu")(state_input)
state_out = layers.Dense(32, activation="relu")(state_out)
# Action as input
action_input = layers.Input(shape=(num_actions))
action_out = layers.Dense(32, activation="relu")(action_input)
# Both are passed through seperate layer before concatenating
concat = layers.Concatenate()([state_out, action_out])
out = layers.Dense(256, activation="relu")(concat)
out = layers.Dense(256, activation="relu")(out)
outputs = layers.Dense(1)(out)
# Outputs single value for give state-action
model = tf.keras.Model([state_input, action_input], outputs)
return model
def policy(state, noise_object):
sampled_actions = tf.squeeze(actor_model(state))
noise = noise_object()
# Adding noise to action
sampled_actions = sampled_actions.numpy() + noise
# We make sure action is within bounds
legal_action = np.clip(sampled_actions, lower_bound, upper_bound)
return [np.squeeze(legal_action)]
from tensorflow.keras import layers
std_dev = 0.2
ou_noise = OUActionNoise(mean=np.zeros(1), std_deviation=float(std_dev) * np.ones(1))
actor_model = get_actor()
critic_model = get_critic()
target_actor = get_actor()
target_critic = get_critic()
# Making the weights equal initially
target_actor.set_weights(actor_model.get_weights())
target_critic.set_weights(critic_model.get_weights())
# Learning rate for actor-critic models
critic_lr = 0.002
actor_lr = 0.001
critic_optimizer = tf.keras.optimizers.Adam(critic_lr)
actor_optimizer = tf.keras.optimizers.Adam(actor_lr)
total_episodes = 100
# Discount factor for future rewards
gamma = 0.99
# Used to update target networks
tau = 0.005
buffer = Buffer(50000, 64)
# To store reward history of each episode
ep_reward_list = []
# To store average reward history of last few episodes
avg_reward_list = []
# To store the best average reward observed so far
best_avg_reward = -float('inf')
# Define the path to save the best model
model_save_path = 'Weights/DDPG/Base'
if not os.path.exists(model_save_path):
os.makedirs(model_save_path)
# Takes about 4 min to train
for ep in range(total_episodes):
prev_state = env.reset()
episodic_reward = 0
# Store the training frames
frames = []
while True:
# Uncomment this to see the Actor in action
# Render the environment and capture frames
frames.append(env.render(mode='rgb_array'))
tf_prev_state = tf.expand_dims(tf.convert_to_tensor(prev_state), 0)
action = policy(tf_prev_state, ou_noise)
# Recieve state and reward from environment.
state, reward, done, info = env.step(action)
buffer.record((prev_state, action, reward, state))
episodic_reward += reward
buffer.learn()
update_target(target_actor.variables, actor_model.variables, tau)
update_target(target_critic.variables, critic_model.variables, tau)
# End this episode when `done` is True
if done:
env.close()
break
prev_state = state
ep_reward_list.append(episodic_reward)
# Mean of last 40 episodes
avg_reward = np.mean(ep_reward_list[-40:])
print("Episode * {} * Avg Reward is ==> {}".format(ep, avg_reward))
avg_reward_list.append(avg_reward)
# Directory where you want to save the files
save_dir = 'training_animations/DDPG/Base'
# Create the directory if it doesn't exist
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Save the best model
if avg_reward > best_avg_reward:
best_avg_reward = avg_reward
# Save the weights of both the actor and critic models
actor_model.save_weights(os.path.join(model_save_path, 'actor_weights.h5'))
critic_model.save_weights(os.path.join(model_save_path, 'critic_weights.h5'))
print("Best weights saved at episode: {}".format(ep + 1))
# Save frames as GIF
gif_filename = os.path.join(save_dir, f'episode_{ep+1}.gif')
imageio.mimsave(gif_filename, frames, duration=0.00333, loop=0)
Episode * 0 * Avg Reward is ==> -1539.2005432588212 Best weights saved at episode: 1 Episode * 1 * Avg Reward is ==> -1551.4147041435936 Episode * 2 * Avg Reward is ==> -1479.3617253261802 Best weights saved at episode: 3 Episode * 3 * Avg Reward is ==> -1442.140613455109 Best weights saved at episode: 4 Episode * 4 * Avg Reward is ==> -1375.9581824614904 Best weights saved at episode: 5 Episode * 5 * Avg Reward is ==> -1419.725227120317 Episode * 6 * Avg Reward is ==> -1470.3477015051521 Episode * 7 * Avg Reward is ==> -1499.1820662443176 Episode * 8 * Avg Reward is ==> -1460.9805425455886 Episode * 9 * Avg Reward is ==> -1473.529132192517 Episode * 10 * Avg Reward is ==> -1464.830294728565 Episode * 11 * Avg Reward is ==> -1428.3362591657126 Episode * 12 * Avg Reward is ==> -1379.5857745240896 Episode * 13 * Avg Reward is ==> -1335.4182287637027 Best weights saved at episode: 14 Episode * 14 * Avg Reward is ==> -1294.1816704414948 Best weights saved at episode: 15 Episode * 15 * Avg Reward is ==> -1308.101272343321 Episode * 16 * Avg Reward is ==> -1282.8530378171915 Best weights saved at episode: 17 Episode * 17 * Avg Reward is ==> -1294.7908443860001 Episode * 18 * Avg Reward is ==> -1280.5643443190077 Best weights saved at episode: 19 Episode * 19 * Avg Reward is ==> -1267.9064120722232 Best weights saved at episode: 20 Episode * 20 * Avg Reward is ==> -1278.7947697289558 Episode * 21 * Avg Reward is ==> -1243.4768376735783 Best weights saved at episode: 22 Episode * 22 * Avg Reward is ==> -1254.75503264108 Episode * 23 * Avg Reward is ==> -1207.7263502857938 Best weights saved at episode: 24 Episode * 24 * Avg Reward is ==> -1164.1783435334196 Best weights saved at episode: 25 Episode * 25 * Avg Reward is ==> -1123.9984111405201 Best weights saved at episode: 26 Episode * 26 * Avg Reward is ==> -1087.0497417457968 Best weights saved at episode: 27 Episode * 27 * Avg Reward is ==> -1070.6751299901562 Best weights saved at episode: 28 Episode * 28 * Avg Reward is ==> -1051.0237092680202 Best weights saved at episode: 29 Episode * 29 * Avg Reward is ==> -1020.1696293328009 Best weights saved at episode: 30 Episode * 30 * Avg Reward is ==> -991.0307577015528 Best weights saved at episode: 31 Episode * 31 * Avg Reward is ==> -960.0833757824973 Best weights saved at episode: 32 Episode * 32 * Avg Reward is ==> -934.7879432423568 Best weights saved at episode: 33 Episode * 33 * Avg Reward is ==> -910.8531667679073 Best weights saved at episode: 34 Episode * 34 * Avg Reward is ==> -891.9532818880958 Best weights saved at episode: 35 Episode * 35 * Avg Reward is ==> -870.6478350475421 Best weights saved at episode: 36 Episode * 36 * Avg Reward is ==> -847.271940846694 Best weights saved at episode: 37 Episode * 37 * Avg Reward is ==> -828.4604078551356 Best weights saved at episode: 38 Episode * 38 * Avg Reward is ==> -807.2964324965203 Best weights saved at episode: 39 Episode * 39 * Avg Reward is ==> -787.2188747948796 Best weights saved at episode: 40 Episode * 40 * Avg Reward is ==> -754.3051229917904 Best weights saved at episode: 41 Episode * 41 * Avg Reward is ==> -722.0739997920452 Best weights saved at episode: 42 Episode * 42 * Avg Reward is ==> -691.9288021536662 Best weights saved at episode: 43 Episode * 43 * Avg Reward is ==> -658.7732012837162 Best weights saved at episode: 44 Episode * 44 * Avg Reward is ==> -631.150746331371 Best weights saved at episode: 45 Episode * 45 * Avg Reward is ==> -599.2168427902054 Best weights saved at episode: 46 Episode * 46 * Avg Reward is ==> -558.0287915830792 Best weights saved at episode: 47 Episode * 47 * Avg Reward is ==> -518.6424264212926 Best weights saved at episode: 48 Episode * 48 * Avg Reward is ==> -493.07792518245907 Best weights saved at episode: 49 Episode * 49 * Avg Reward is ==> -456.6486988073152 Best weights saved at episode: 50 Episode * 50 * Avg Reward is ==> -425.20947333325313 Best weights saved at episode: 51 Episode * 51 * Avg Reward is ==> -402.73390557256664 Best weights saved at episode: 52 Episode * 52 * Avg Reward is ==> -385.9151707607067 Best weights saved at episode: 53 Episode * 53 * Avg Reward is ==> -367.02086803860755 Best weights saved at episode: 54 Episode * 54 * Avg Reward is ==> -352.324302926947 Best weights saved at episode: 55 Episode * 55 * Avg Reward is ==> -317.4796295666272 Best weights saved at episode: 56 Episode * 56 * Avg Reward is ==> -295.57287187181385 Best weights saved at episode: 57 Episode * 57 * Avg Reward is ==> -261.37425318983986 Best weights saved at episode: 58 Episode * 58 * Avg Reward is ==> -241.91303036738955 Best weights saved at episode: 59 Episode * 59 * Avg Reward is ==> -216.34273637516154 Best weights saved at episode: 60 Episode * 60 * Avg Reward is ==> -181.98902686802882 Best weights saved at episode: 61 Episode * 61 * Avg Reward is ==> -169.52964378086872 Best weights saved at episode: 62 Episode * 62 * Avg Reward is ==> -139.89311259371243 Best weights saved at episode: 63 Episode * 63 * Avg Reward is ==> -139.6465718125118 Best weights saved at episode: 64 Episode * 64 * Avg Reward is ==> -142.70839373468152 Episode * 65 * Avg Reward is ==> -148.3723436941816 Episode * 66 * Avg Reward is ==> -148.24627516005347 Episode * 67 * Avg Reward is ==> -135.49320453406784 Best weights saved at episode: 68 Episode * 68 * Avg Reward is ==> -126.14511136140109 Best weights saved at episode: 69 Episode * 69 * Avg Reward is ==> -126.08400943847747 Best weights saved at episode: 70 Episode * 70 * Avg Reward is ==> -129.1103686176189 Episode * 71 * Avg Reward is ==> -137.0788753305625 Episode * 72 * Avg Reward is ==> -137.01320031680217 Episode * 73 * Avg Reward is ==> -136.87226632576215 Episode * 74 * Avg Reward is ==> -133.6069472385693 Episode * 75 * Avg Reward is ==> -133.52889256156521 Episode * 76 * Avg Reward is ==> -136.56392990205578 Episode * 77 * Avg Reward is ==> -136.4648676894538 Episode * 78 * Avg Reward is ==> -142.41548304125865 Episode * 79 * Avg Reward is ==> -145.33096830446038 Episode * 80 * Avg Reward is ==> -145.65327717382146 Episode * 81 * Avg Reward is ==> -138.85629890445767 Episode * 82 * Avg Reward is ==> -141.6481011786871 Episode * 83 * Avg Reward is ==> -147.3504437098939 Episode * 84 * Avg Reward is ==> -150.20451196344598 Episode * 85 * Avg Reward is ==> -144.1634513852633 Episode * 86 * Avg Reward is ==> -144.01674410506095 Episode * 87 * Avg Reward is ==> -140.94156812607622 Episode * 88 * Avg Reward is ==> -140.55901416064447 Episode * 89 * Avg Reward is ==> -140.51692135440388 Episode * 90 * Avg Reward is ==> -150.08228567840018 Episode * 91 * Avg Reward is ==> -149.9028921261845 Episode * 92 * Avg Reward is ==> -152.6202558505941 Episode * 93 * Avg Reward is ==> -155.42204337772856 Episode * 94 * Avg Reward is ==> -155.3788122217969 Episode * 95 * Avg Reward is ==> -155.36686547072946 Episode * 96 * Avg Reward is ==> -158.48140642917332 Episode * 97 * Avg Reward is ==> -158.36345085077414 Episode * 98 * Avg Reward is ==> -161.38545898408566 Episode * 99 * Avg Reward is ==> -164.372993305075
plot_rewards(avg_reward_list)
compute_moving_average_and_plot(avg_reward_list)
GIF of highest reward attempt¶
Results¶
From the GIF, we can see that the model can balance the pendulum effectively. When the pendulum reaches the upright position, the DDPG model is able to maintain this position and prevent the pendulum from falling. This indicates that the model has successfully learned the optimal policy for applying the necessary torque to keep the pendulum balanced.
The graph shows a consistent upward trend in the total reward per episode from the start until around episode 40. This indicates that the model is learning effectively and improving its performance over time.
After episode 40, the reward curve flattens, suggesting that the model has reached a plateau in learning. The fluctuations in the rewards are minimal, indicating stable training without significant volatility.
The moving average graph further confirms the stability of the model's performance. The moving average curve smooths out the episodic fluctuations and provides a clearer view of the overall trend.
There is a steady increase in the moving average until around episode 40, after which it stabilizes, showing that the model has consistently achieved a certain level of performance.
The highest rewards are observed around -200, which suggests that the model is not only balancing the pendulum but also maintaining it with fewer error.
The range of rewards from -1400 to around -200 shows that the model has significantly improved from the initial episodes where the performance was suboptimal.
Running the DDPG weights multiple times (Test the DDPG Model) ¶
# Define the actor model for DDPG
def get_actor(num_states, upper_bound):
last_init = tf.random_uniform_initializer(minval=-0.003, maxval=0.003)
inputs = layers.Input(shape=(num_states,))
out = layers.Dense(256, activation="relu")(inputs)
out = layers.Dense(256, activation="relu")(out)
outputs = layers.Dense(1, activation="tanh", kernel_initializer=last_init)(out)
# Scale the outputs to match the action bounds
outputs = outputs * upper_bound
model = tf.keras.Model(inputs, outputs)
return model
# Load the best actor model
def load_best_actor(model_path, num_states, upper_bound):
model = get_actor(num_states, upper_bound)
# Create a dummy input and call the model to initialize variables
dummy_input = tf.zeros((1, num_states))
model(dummy_input)
# Now load the weights
model.load_weights(model_path)
return model
# Test the loaded model
def test_model(env_string, model, dir_name, num_episodes=5):
env = gym.make(env_string)
rewards = []
for episode in range(num_episodes):
# Save the Frames
frames = []
state = env.reset()
done = False
total_reward = 0
while not done:
state = tf.convert_to_tensor([state], dtype=tf.float32)
action_values = model(state)
action = action_values.numpy()[0]
next_state, reward, done, _ = env.step(action)
# Render the environment and capture frames
frames.append(env.render(mode='rgb_array'))
state = next_state
total_reward += reward
# Directory where you want to save the files
save_dir = f'test_animations/DDPG/{dir_name}'
# Create the directory if it doesn't exist
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Save frames as GIF
gif_filename = os.path.join(save_dir, f'episode_{episode+1}.gif')
imageio.mimsave(gif_filename, frames, duration=0.00333, loop=0)
print(f"Episode: {episode+1}, Total Reward: {total_reward:.2f}")
rewards.append(total_reward)
env.close()
return rewards
env_string = 'Pendulum-v0'
num_states = 3
upper_bound = 2.0 # Action upper bound for Pendulum
model_path = 'Weights/DDPG/Base/actor_weights.h5'
dir_name = 'Base'
# Load the best actor model and test it
best_actor_model = load_best_actor(model_path, num_states, upper_bound)
rewards = test_model(env_string, best_actor_model, dir_name, num_episodes=20)
Episode: 1, Total Reward: -248.00 Episode: 2, Total Reward: -248.41 Episode: 3, Total Reward: -249.28 Episode: 4, Total Reward: -342.41 Episode: 5, Total Reward: -121.19 Episode: 6, Total Reward: -4.64 Episode: 7, Total Reward: -116.16 Episode: 8, Total Reward: -243.09 Episode: 9, Total Reward: -234.81 Episode: 10, Total Reward: -1.45 Episode: 11, Total Reward: -115.40 Episode: 12, Total Reward: -250.33 Episode: 13, Total Reward: -356.33 Episode: 14, Total Reward: -121.33 Episode: 15, Total Reward: -118.83 Episode: 16, Total Reward: -1.48 Episode: 17, Total Reward: -252.90 Episode: 18, Total Reward: -123.94 Episode: 19, Total Reward: -119.39 Episode: 20, Total Reward: -236.38
plot_rewards(rewards)
display_gifs_in_grid('test_animations/DDPG/Base')




















Results¶
The DDPG successfully balances the pendulum in 100% of trials (20/20) while maintaining rewards that range from -14 to -270. This suggests that the model is highly robust and effective, achieving consistent control despite variations in reward outcomes. The broad range of rewards indicates that while the model performs reliably, the starting conditions or other factors may affect reward levels. Overall, this performance reflects the model’s strong capability and reliability in balancing the pendulum.
Final Model ¶
Configuration of Final DQN Model¶
We have used the Exploration vs Exploitation¶
Uses epsilon to explore diverse actions early on and shift to exploiting the learned policy as training progresses, preventing suboptimal policies and stabilizing learning.
We have used soft updates instead of hard updates¶
Gradually updates the target network's weights to prevent drastic changes in Q-values, leading to more stable and smooth convergence compared to hard updates.
We have decreased number of episodes¶
Focuses on fewer episodes to improve learning quality and prevent overfitting, as performance stabilizes within the reduced number of episodes.
We have Found the optimal value of tau to be at 0.01¶
Ensures smooth updates to the target network, reducing abrupt changes and contributing to stable learning.
We have increaed the number of dense neurons¶
Enhances model capacity to capture complex patterns in the state-action space, leading to more accurate Q-function approximation and stable learning.
We have found the optimal number of actions to be at 5¶
By optimizing the number of discrete actions to 5, we found a balance that provides sufficient granularity for the agent's decisions while maintaining computational efficiency.
We have found the optimal value for Gamma to be at 0.98¶
Provides a good balance between immediate and future rewards, supporting stable and consistent training by avoiding extreme short-term or long-term reward biases.
We have found the optimal value of learning rate at 0.01¶
Balances the speed and stability of learning, preventing issues from too high or too low rates and aiding in effective convergence.
class Memory:
# Constructor, the capacity is the maximum size of the memory. Once capacity is reached, the old memories are removed.
def __init__(self, capacity):
self.memory = deque(maxlen=capacity)
'''
Adds transition to the memory. The transition is a tuple of (state, action, reward, next_state).
If the maximum capacity of memory is reached, the old memories are overrided.
'''
def update(self, transition):
self.memory.append(transition)
'''
Retrieve a random sample from memory, batch size indicates the number of samples to be retrieved.
'''
def sample(self, batch_size):
return random.sample(self.memory, batch_size)
class Net(tf.keras.Model):
# constructor initializes the layers
def __init__(self, input_size, num_actions):
super(Net, self).__init__()
self.dense1 = Dense(128, activation='relu', input_shape=(input_size,))
self.dense2 = Dense(128, activation='relu')
self.output_layer = Dense(num_actions, activation='linear')
'''
This function is the forward pass of the model. It takes the state as input and returns the Q values for each action.
'x' is the input to the model, which is the state from the environment. 'x' is then passed through the 2 dense
layers and the output layer to get the predicted values for each action. We use a linear activation function 'relu' to
predict a wide range of values, to estimate action values in the RL model
'''
def call(self, x):
# first dense layer
x = self.dense1(x)
# second dense layer
x = self.dense2(x)
return self.output_layer(x)
class Agent:
# this constructor initializes the environment, model, memory, and other variables required for the agent
def __init__(self, env_string, num_actions, state_size, batch_size=32, learning_rate=0.01, gamma=0.98, epsilon=1.0, epsilon_decay=0.98, epsilon_min=0.01, tau=0.01, memory_capacity=10000):
self.env_string = env_string
self.env = gym.make(env_string)
self.env.reset()
self.state_size = state_size
self.num_actions = num_actions
self.action_scope = [i * 4.0 / (num_actions - 1) - 2.0 for i in range(num_actions)] # Adjusted to match action scaling in dueling DQN
self.batch_size = batch_size
self.gamma = gamma
self.epsilon = epsilon
self.epsilon_decay = epsilon_decay
self.epsilon_min = epsilon_min
self.tau = tau
self.memory = Memory(memory_capacity)
self.best_total_reward = float('-inf')
self.model = Net(state_size, num_actions)
self.target_model = Net(state_size, num_actions)
self.optimizer = Adam(learning_rate)
self.loss_fn = tf.losses.MeanSquaredError()
'''
Implements the epsilon-greedy policy. With probability epsilon, a random action is selected, otherwise the action chosen will
be the one with the highest Q value. The epsilon value is decayed over time to reduce the exploration as the agent learns.
It also converts the state to a tensor before passing through to the model.
'''
def select_action(self, state):
# if random number is less than epsilon, return random action
if np.random.rand() < self.epsilon:
return np.random.randint(self.num_actions), None
# convert to tensor
state = tf.convert_to_tensor([state], dtype=tf.float32)
q_values = self.model(state)
# else, return action with highest Q value
return np.argmax(q_values.numpy()), None
# Store experience tuple into the memory buffer, function is above
def store_transition(self, state, action, reward, next_state):
self.memory.update((state, action, reward, next_state, False))
'''
Learn function performs single step training on a batch of experiences sampled from the memory buffer.
'''
def learn(self):
if len(self.memory.memory) < self.batch_size:
return
# sample a batch of transitions from the memory
transitions = self.memory.sample(self.batch_size)
# extract the states, actions, rewards, next states, from the batch
state_batch, action_batch, reward_batch, next_state_batch, _ = zip(*transitions)
# convert the s, a, r, s_' to tensors
state_batch = tf.convert_to_tensor(state_batch, dtype=tf.float32)
action_batch = tf.convert_to_tensor(action_batch, dtype=tf.int32)
reward_batch = tf.convert_to_tensor(reward_batch, dtype=tf.float32)
next_state_batch = tf.convert_to_tensor(next_state_batch, dtype=tf.float32)
'''
Tensorflow GradientTape is an API for automatic differentiation. It records operations for automatic differentiation.
This function will calculate the predicted Q-values from the current state and actions.
'''
with tf.GradientTape() as tape:
'''
Forward pass through the DQN model to get the Q-values for all actions given the current batch of states.
q-values contains the predicted q-values for each action in each state of the batch
'''
q_values = self.model(state_batch)
'''
'action_indices' calculates the indices of the actions taken in the Q-value matrix.
Since q_values contains Q-values for all actions, this step is necessary to select only the Q-values corresponding to the actions that were actually taken.
'''
action_indices = tf.range(self.batch_size) * self.num_actions + action_batch
'''
Reshapes the Q-value matrix to a single vector, then selects the Q-values for the actions taken using the 'action_indices' calculated above.
'''
predicted_q = tf.gather(tf.reshape(q_values, [-1]), action_indices)
'''
Forward pass through the target DQN model to get Q-values for all actions given the next states.
'''
next_q_values = self.target_model(next_state_batch)
# Finds the maximum Q-value among all actions for each next state, which represents the best possible future reward achievable from the next state.
max_next_q = tf.reduce_max(next_q_values, axis=1)
'''
Calculates the target Q-values using the immediate reward received (reward_batch) and the
discounted maximum future reward (self.gamma * max_next_q). This forms the update target for the Q-value of the action taken.
'''
target_q = reward_batch + self.gamma * max_next_q
# Calculates the loss between the predicted Q-values and the target Q-values, basically MSE loss
loss = self.loss_fn(target_q, predicted_q)
# Calculate the gradients of the loss with respect to the model parameters
grads = tape.gradient(loss, self.model.trainable_variables)
self.optimizer.apply_gradients(zip(grads, self.model.trainable_variables))
self.update_epsilon()
# update the epsilon value
def update_epsilon(self):
self.epsilon = max(self.epsilon_min, self.epsilon_decay * self.epsilon)
# update the model
def update_target_model(self):
main_weights = self.model.get_weights()
target_weights = self.target_model.get_weights()
for i in range(len(target_weights)):
target_weights[i] = self.tau * main_weights[i] + (1 - self.tau) * target_weights[i]
self.target_model.set_weights(target_weights)
# checks if current model is the best model based on total reward
def is_best_model(self, total_reward):
return total_reward > self.best_total_reward
# updates the best model with the current model
def update_best_model(self, total_reward):
self.best_total_reward = total_reward
# Main function to run the training
def main():
env_string = 'Pendulum-v0'
num_actions = 5
state_size = gym.make(env_string).observation_space.shape[0]
agent = Agent(env_string, num_actions, state_size)
episodes = 25
rewards = [] # Initialize rewards list to store total rewards per episode
for ep in range(episodes):
state = agent.env.reset()
done = False
total_reward = 0
# Capture frames for GIF
frames = []
while not done:
# Render the environment and capture frames
frames.append(agent.env.render(mode='rgb_array'))
action_index, _ = agent.select_action(state)
action = [agent.action_scope[action_index]]
next_state, reward, done, _ = agent.env.step(action)
agent.store_transition(state, action_index, reward, next_state)
state = next_state
total_reward += reward
agent.learn()
agent.update_target_model()
rewards.append(total_reward) # Append the total reward of the episode to the rewards list
print(f"Episode: {ep+1}, Total Reward: {total_reward:.2f}, Epsilon: {agent.epsilon:.2f}")
if agent.is_best_model(total_reward):
# Update and save the best model weights
agent.update_best_model(total_reward)
model_dir = 'FinalRLWeights'
save_model_weights(agent, model_dir)
print("Best model weights saved.")
# Directory where you want to save the files
save_dir = 'training_animations/Final'
# Create the directory if it doesn't exist
if not os.path.exists(save_dir):
os.makedirs(save_dir)
# Save frames as GIF
gif_filename = os.path.join(save_dir, f'episode_{ep+1}.gif')
imageio.mimsave(gif_filename, frames, duration=0.00333, loop=0) # Adjust duration as needed
agent.env.close()
return rewards
rewards = main()
Episode: 1, Total Reward: -1313.97, Epsilon: 0.03 Best model weights saved. Episode: 2, Total Reward: -1752.51, Epsilon: 0.01 Episode: 3, Total Reward: -1561.45, Epsilon: 0.01 Episode: 4, Total Reward: -1198.90, Epsilon: 0.01 Best model weights saved. Episode: 5, Total Reward: -948.70, Epsilon: 0.01 Best model weights saved. Episode: 6, Total Reward: -1139.11, Epsilon: 0.01 Episode: 7, Total Reward: -1367.59, Epsilon: 0.01 Episode: 8, Total Reward: -359.80, Epsilon: 0.01 Best model weights saved. Episode: 9, Total Reward: -2.20, Epsilon: 0.01 Best model weights saved. Episode: 10, Total Reward: -394.12, Epsilon: 0.01 Episode: 11, Total Reward: -1.27, Epsilon: 0.01 Best model weights saved. Episode: 12, Total Reward: -239.96, Epsilon: 0.01 Episode: 13, Total Reward: -248.52, Epsilon: 0.01 Episode: 14, Total Reward: -127.38, Epsilon: 0.01 Episode: 15, Total Reward: -126.39, Epsilon: 0.01 Episode: 16, Total Reward: -132.15, Epsilon: 0.01 Episode: 17, Total Reward: -126.19, Epsilon: 0.01 Episode: 18, Total Reward: -259.38, Epsilon: 0.01 Episode: 19, Total Reward: -635.26, Epsilon: 0.01 Episode: 20, Total Reward: -67.58, Epsilon: 0.01 Episode: 21, Total Reward: -346.39, Epsilon: 0.01 Episode: 22, Total Reward: -122.10, Epsilon: 0.01 Episode: 23, Total Reward: -130.27, Epsilon: 0.01 Episode: 24, Total Reward: -125.42, Epsilon: 0.01 Episode: 25, Total Reward: -2.76, Epsilon: 0.01
Final Model Evaluation ¶
Rewards ¶
plot_rewards(rewards)
Results¶
The graph shows a positive trend in total rewards as the number of episodes increases. Initially, the rewards are around -1650, but they improve over time, reaching approximately -3 by the end of the 25 episodes. This suggests that the model is learning and achieving higher rewards. Overall, the rewards earned seem to increase, indicating successful learning.
While there are fluctuations in reward from episode to episode, the curve generally smooths out towards later episodes. This suggests that the model is stabilizing its performance. The fluctuations in reward from episode to episode suggest that the model is exploring different actions and strategies. This exploration is a critical part of the reinforcement learning process, as it allows the model to discover potentially better strategies rather than sticking to known, suboptimal ones. Overall, the model shows great stability; when the curve spikes, the fluctuations are not extreme, suggesting a stable learning process.
In terms of learning speed, the model is able to find the optimal rewards (close to 0) in around 8 episodes.
To conclude, the model achieves a good balance between learning speed and stability, consistently reaching optimal rewards while maintaining good stability.
Moving Average ¶
compute_moving_average_and_plot(rewards)
Results¶
The moving average curve shows a clear upward trend, indicating that the model is generally improving over time. This positive trend suggests that the agent is learning more effective strategies and consistently achieving higher rewards.
While there are some minor fluctuations, the overall movement of the curve is smooth. This smoothness suggests that the model is not experiencing large swings in performance, which is a good indication of stability. The absence of large deviations or volatility in the curve implies that the learning process is steady, and the agent is refining its policies without frequent drastic changes. This consistent improvement and stability are key indicators of a successful training process.
GIF of highest reward attempt ¶
Results¶
As observed in the GIF, the Deep Q-Network (DQN) effectively balances the pendulum. When the pendulum reaches the upright position, the DQN accurately selects and applies the necessary control actions to stabilize it, preventing it from falling. This shows that the DQN has successfully learned to approximate the optimal action-value function, allowing it to apply the correct torques to maintain the pendulum in an upright position.
Running the best weights multiple times (Test the Final Model) ¶
if __name__ == '__main__':
env_string = 'Pendulum-v0'
input_size = gym.make(env_string).observation_space.shape[0]
num_actions = 5 # Must match the number used during training
model_dir = 'FinalRLWeights'
weights_path = os.path.join(model_dir, 'tensorflow_dqn_weights.h5')
# Ensure the model path exists
if os.path.exists(weights_path):
model = load_model(weights_path, input_size, num_actions)
rewards = test_model(env_string, model,'Final', 20)
else:
print(f"Model weights not found in {weights_path}. Please ensure the correct path.")
Episode: 1, Total Reward: -247.48 Episode: 2, Total Reward: -126.97 Episode: 3, Total Reward: -117.97 Episode: 4, Total Reward: -123.70 Episode: 5, Total Reward: -123.71 Episode: 6, Total Reward: -126.20 Episode: 7, Total Reward: -116.97 Episode: 8, Total Reward: -122.62 Episode: 9, Total Reward: -128.90 Episode: 10, Total Reward: -1.55 Episode: 11, Total Reward: -1.65 Episode: 12, Total Reward: -123.04 Episode: 13, Total Reward: -125.70 Episode: 14, Total Reward: -247.52 Episode: 15, Total Reward: -120.44 Episode: 16, Total Reward: -119.64 Episode: 17, Total Reward: -127.21 Episode: 18, Total Reward: -135.07 Episode: 19, Total Reward: -1.53 Episode: 20, Total Reward: -132.20
plot_rewards(rewards)
display_gifs_in_grid('test_animations/Final')




















Results¶
We can see that after running the best model 20 times, it consistently balances the pendulum 100% of the time. This suggests that the model has achieved a high level of robustness and reliability in its control strategy. Its consistent performance indicates that the learned policy is well-generalized and effective across multiple trials, implying that the model has successfully converged to an optimal solution for balancing the pendulum. This level of performance also reflects the model's strong ability to handle the dynamics of the environment and make precise, stable control decisions. Furthermore, the consistent success across multiple runs indicates that balancing the pendulum is not due to luck, but rather a result of the model’s effective learning and reliable decision-making.
We can also see that during these 20 trials, the model's rewards range from close to 0 to around -350. Rewards close to 0 indicate successful balancing, where the pendulum remains in the upright position for most of the time. On the other hand, rewards around -350 may suggest that when the pendulum's starting position is randomly selected and not upright, the model’s reward is adversely affected from the outset. However, despite the low rewards in some trials due to unfavorable starting positions, the model is still able to balance the pendulum effectively.
Final Model Verdict ¶
The Deep Q-Network (DQN) model for balancing a pendulum reveals a positive trend in its performance. The graph of total rewards shows an improvement over time, with initial rewards around -1650 gradually increasing to approximately -3 by the end of 25 episodes. This upward trend indicates that the model is learning effectively and achieving higher rewards.
Despite some fluctuations in reward from episode to episode, the curve generally smooths out towards later episodes, suggesting that the model is stabilizing its performance. These fluctuations are a normal part of the reinforcement learning process, reflecting the model's exploration of different actions and strategies. This exploration helps the model discover better strategies and avoid suboptimal ones. Overall, the model demonstrates great stability, as the fluctuations are not extreme, indicating a stable learning process.
In terms of learning speed, the model finds the optimal rewards (close to 0) in approximately 8 episodes, achieving a good balance between learning speed and stability. The moving average curve further supports this, showing a clear upward trend that suggests consistent improvement and effective learning over time.
As observed in the GIF, the DQN effectively balances the pendulum by accurately selecting and applying the necessary control actions to maintain the pendulum in the upright position. This demonstrates that the model has successfully learned to approximate the optimal action-value function, allowing it to handle the dynamics of the environment effectively.
After running the best model 20 times, it consistently balances the pendulum 100% of the time, reflecting a high level of robustness and reliability. This consistent performance indicates that the model has converged to an optimal solution and is not reliant on luck. Even though rewards in some trials range from close to 0 to around -350, likely due to random starting positions, the model still demonstrates effective balancing capabilities. This suggests that the model's learning and decision-making are both robust and reliable, ensuring stable performance despite varying initial conditions.
Conclusions ¶
We have developed a Deep Q-Network (DQN) model that consistently balances the pendulum. The development process began with a basic DQN model, followed by the implementation of exploration versus exploitation strategies. We then transitioned from hard updates to soft updates, reduced the number of episodes, and optimized the tau parameter. Next, we adjusted the number of dense neurons and determined the optimal number of actions. We also fine-tuned the gamma parameter and identified the ideal learning rate. With all these improvements, we successfully created the best DQN model for balancing the pendulum.
We have also explored different reinforcement learning algorithms beyond DQN, including Double DQN, Dueling DQN, and DDPG. For the Double DQN model, we investigated methods such as gradient clipping and prioritized experience replay. In the Dueling DQN model, we explored techniques like the Boltzmann exploration policy.
#⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢀⣀⣤⡤⠴⠶⣦⣤⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
#⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣀⣴⡾⠛⠉⠀⠀⠀⠀⠀⠈⠙⢿⣦⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
#⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣠⣾⠟⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢻⣧⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
#⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢠⣾⠟⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⢸⣿⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
#⠀⠀⢀⣠⣴⣶⣶⣶⣶⣶⣤⣤⣄⣀⣀⡀⢀⣴⡿⠋⠀⠀⢀⡀⠰⣾⣆⠀⠀⠀⠀⠀⠀⠀⢀⣾⠏⠀⢠⣤⣀⠀⠀⠀⠀⠀⠀
#⢀⣴⠛⠉⠀⠀⠀⠀⠀⠈⠉⠉⢉⣿⠟⢻⣿⡿⠿⠿⠿⠟⠋⠀⠀⠙⠿⣷⣤⣀⡀⠀⣀⣤⠿⠃⠀⣠⣿⡟⠁⠀⠀⠀⠀⠀⠀
#⣸⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⣰⣿⠋⢠⣾⠟⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠉⠙⠉⠉⠁⠀⠀⣰⣿⠟⠀⠀⠀⠀⠀⠀⠀⠀
#⢹⣧⡀⠀⠀⠀⠀⠀⠀⠀⣰⣿⠇⢠⣿⠋⣀⣤⡀⠀⠀⢀⣀⣀⣀⡀⠀⢀⣤⡄⢀⣠⣤⡄⠀⣰⣿⠋⣠⣶⣶⡄⠀⠀⠀⠀⠀
#⠀⠻⠿⣿⣶⠀⠀⠀⠀⣰⣿⡏⢠⣿⣷⣾⣿⡿⠃⣠⣾⠟⢉⣿⡟⢁⣴⠟⣽⣿⠟⣹⣿⠇⣰⣿⣿⣟⣁⣴⡿⠁⠀⠀⠀⠀⣄
#⠀⠀⠀⠀⠀⠀⠀⠀⣰⣿⡟⢀⣿⡿⠋⣾⡟⢁⣴⣿⣋⣠⣿⡏⣰⡿⢁⣼⡿⠃⢠⣿⣧⣾⣿⡿⢻⣿⡉⠁⠀⠀⠀⠀⠀⢻⣿
#⠀⠀⠀⠀⠀⠀⠀⣰⣿⡿⠀⠈⠻⠁⠀⠿⠿⠛⠙⠻⠛⠁⠻⠟⠋⠀⠈⠛⠁⠀⠘⠿⠛⠹⣿⠁⠀⠻⣿⣄⠀⠀⠀⠀⠀⢸⣿
#⠀⣠⡀⠀⠀⠀⢠⣿⡟⠁⠀⠀⠀⠀⠀⠀⠀⣠⠀⠀⠀⢀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠙⢿⣦⣄⡀⢀⣠⡿⠃
#⢸⡟⠁⠀⠀⣰⡿⠋⠀⠀⠀⠀⠀⠀⠀⢠⣿⠏⠀⢠⣿⠟⣠⣴⡾⣿⣿⠂⠀⣤⣤⠆⠀⣤⣤⡆⠀⠀⠀⠀⠈⠉⠛⠋⠉⠀⠀
#⠘⠷⠤⠴⠟⠋⠀⠀⠀⠀⠀⠀⠀⠀⢠⣿⠏⠀⣴⣿⣟⣼⡿⠋⢰⣿⣿⢃⣾⡿⠁⢀⣾⣿⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
#⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⣾⡟⢠⣾⣿⡏⢸⣿⠁⠀⣼⣿⠟⣿⡟⠀⣰⣿⡿⠁⠀⠀⣀⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀
#⠀⠀⠀⠀⠀⢀⣀⣀⣀⣀⡀⠀⠀⠀⠻⠿⢫⣿⠟⠀⠸⣿⣴⣾⠟⠁⠀⠻⣷⠟⠋⢸⡇⠀⢀⣴⣿⠟⠀⠀⠀⠀⠀⠀⠀⠀⠀
#⠀⠀⢀⣴⠟⠛⠉⠛⠛⠛⠿⠿⣿⣶⣶⣤⣿⣏⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠻⠶⠟⠛⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
#⠀⠀⢸⡇⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⣹⣿⠛⠿⢿⣷⣶⣤⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
#⠀⠀⠘⣷⡄⠀⠀⠀⠀⠀⠀⠀⢀⣾⡟⠁⠀⠀⠀⠀⠉⠙⠛⠻⠿⣶⣶⣤⣤⣀⣀⣀⣀⣀⣀⣀⣤⠾⠀⠀⠀⠀⠀⠀⠀⠀⠀
#⠀⠀⠀⠙⢿⣦⡀⠀⠀⠀⢀⣴⠿⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠛⠛⠛⠿⠿⠛⠛⠉⠁⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
#⠀⠀⠀⠀⢀⣉⣛⠓⠒⠚⠋⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
Thank You for reading this notebook, have a great day ahead!